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ABSTRACT 


This  report  summarises  the  results  obtained  during  the  contract  period.  The 
first  *  wo  sections  of  the  report  and  the  three  appendices  summarise  work  on  the 
Berks  v  V*x  UNIX  system.  The  directions  for  the  new  work  for  the  next  con¬ 
tract  d  are  summarised.  A  major  portion  of  this  part  of  the  report  gives  a 
descrip  un  of  the  new  file  system  and  networking  facilities  that  were  imple¬ 
mented  to  meet  the  needs  of  the  ARPA  research  community.  The  third  and 
fourth  sections  summarise  research  in  human /machine  interaction  and  in  expert 
database  systems.  This  work  was  initiated  later  in  the  contract  period. 

The  first  section  describes  the  basic  kernel  functions  provided  to  a  UNIX 
process:  process  naming  and  protection,  memory  management,  software  inter¬ 
rupts,  object  references  (descriptors),  time  and  statistics  functions,  and  resource 
controls.  These  facilities,  as  well  as  facilities  for  bootstrap,  shutdown  and  process 
accounting,  are  provided  solely  by  the  kernel. 

The  second  section  describes  the  standard  system  abstractions  for  files  and 
file  systems,  communication,  terminal  handling,  and  process  control  and  debug¬ 
ging  These  facilities  are  implemented  by  the  operating  system  or  by  network 
server  processes.  The  first  of  three  appendixes  summarises  the  system  primitives. 

The  second  appendix  describes  a  reimplementation  of  the  UNIX  file  system. 
The  reimplementation  provides  substantially  higher  throughput  rates  by  using 
more  f  exible  allocation  policies,  that  allow  better  locality  of  reference  and  that 
car  be  adapted  to  a  wide  range  of  peripheral  and  processor  characteristics.  The 
new  file  system  clusters  data  that  b  sequentially  accessed  and  provides  two  block 
sites  to  allow  fast  access  for  large  files  while  not  wasting  large  amounts  of  space 
for  small  files.  File  access  rates  of  up  to  ten  times  faster  than  the  traditional 
UNIX  file  system  are  experienced.  Long  needed  enhancements  to  the  user  inter¬ 
face  are  discussed.  These  include  a  mechanism  to  lock  files,  extensions  of  the 
name  space  across  file  systems,  the  ability  to  use  arbitrary  length  file  names,  and 
provisions  for  efficient  administrative  control  of  resource  usage. 

The  last  appendix  gives  a  detailed  description  of  the  internal  structure  of 
the  networking  facilities.  These  facilities  are  based  on  several  central  abstrac¬ 
tions  that  structure  the  external  (user)  view  of  netvork  communication  as  well  as 
the  internal  (system)  implementation. 

The  third  section  of  the  report  gives  a  brief  summary  of  the  research  ini¬ 
tiated  in  February  1984  under  Supplement  A  to  the  contract.  Most  of  thoae  pro¬ 
jects  in  the  areas  of  computer  graphics,  intelligent  computer  systems,  and 
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Final  Report  -  1  -  Notation  and  types 

I.  Unix  System  Enhancements 


0.  Notation  and  types 

The  notation  used  to  describe  system  calls  is  a  variant  of  a  C  language  call,  collating  of  a 
prototype  call  followed  by  declaration  of  parameters  and  results.  An  additional  keyword  raeult, 
not  part  of  the  normal  C  language,  is  used  to  indicate  which  of  the  declared  entities  receive 
results.  Ar  an  example,  consider  the  read  call,  as  described  in  section  2.1:  , 

ec  — '  read(fd,  buf,  nbytes); 

result  int  cc;  int  fd;  result  char  *buf,  mt  nbytes; 

The  first  line  shows  how  the  read  routine  is  called,  with  three  parameters.  As  shown  on  the 
second  line  ec  is  an  integer  and  read  also  returns  information  in  the  parameter  iuf. 

Description  of  all  error  conditions  arising  from  each  system  call  is  not  provided  here;  they 
appear  in  the  programmer’s  manual.  In  particular,  when  accessed  from  the  C  language,  many 
calls  return  a  characteristic  -1  value  when  an  error  occurs,  returning  the  error  code  in  the  global 
variable  ermo.  Other  languages  may  present  errors  in  different  ways. 

A  number  of  system  standard  types  are  defined  in  the  include  file  <sys/types.h>  and  used 
in  the  specifications  here  and  in  many  C  programs.  These  include  eaddr_t  giving  a  memory 
address  (typically  as  a  character  pointer),  o£L*  giving  a  file  offset  (typically  as  a  long  integer), 
and  a  set  of  unsigned  types  u_chur,  u.short,  u_int  and  taking,  shorthand  names  for  unsigned 
char,  unsigned  short,  etc. 


1.  Kernel  primitives 

The  facilities  available  to  a  UNIX  user  process  art  logically  divided  into  two  parts:  kernel 
facilities  directly  implemented  by  UNIX  code  running  in  the  operating  system,  and  system  facili¬ 
ties  implemented  either  by  the  system,  or  in  cooperation  with  a  server  process.  These  kernel 
facilities  art  described  in  this  section  1. 

The  facilities  implemented  in  the  kernel  are  those  which  define  the  UNIX  virtual  machine 
which  each  process  runs  in.  Like  many  real  machines,  this  virtual  machine  has  memory  manage¬ 
ment  hardware,  an  interrupt  facility,  timers  and  counters.  The  UNIX  virtual  machine  also  allows 
access  to  files  and  other  objects  through  a  set  of  descriptors.  Each  descriptor  resembles  a  device 
controller,  and  supports  a  set  of  operations.  Like  devices  on  real  machines,  some  of  which  are 
internal  to  the  machine  and  some  of  which  are  external,  parts  of  the  descriptor  machinery  are 
built-in  to  the  operating  system,  while  other  parts  are  often  implemented  in  server  processes  on 
other  machines.  The  facilities  provided  through  the  descriptor  machinery  are  described  in  section 

2. 


1.1.  Processes  and  protection 

1.1.1.  Host  and  process  identifiers 

Each  UNIX  host  has  associated  with  it  a  32-bit  host  id,  and  a  host  name  of  up  to  255  char¬ 
acters.  These  are  set  (by  a  privileged  user)  and  returned  by  the  calls: 
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sethostid  (bostid) 
long  bostid; 

bcstid  ■»  gethostidQ; 
result  long  hoc 

sethoetname(n»  ;n)  • 

char  ‘name,  Lot  lei»; 

ten  -=  gethastname(buf,  buden) 
result  int  len;  result  char  *buf;  int  buflen; 

On  each  host  runs  a  set  of  processes.  Each  process  ic  largely  independent  of  other  processes,  hav¬ 
ing  its  own  protection  domain,  address  space,  timers,  and  an  independent  set  of  references  to  sys¬ 
tem  or  user  implemented  objects. 

Each  process  in  a  host  is  named  by  an  integer  called  the  process  id.  This  number  is  in  the 
range  1-50000  and  is  returned  by  the  petpid  routine: 

pid  —  getpid(), 
result  int  pid; 

On  each  UNIX  host  this  identifier  is  guaranteed  to  be  unique;  in  a  multi-host  environment,  the 
(bostid,  process  id)  pairr  are  guaranteed  unique. 

1.1.2.  Process  creation  and  termination 

A  new  process  is  created  by  making  a  logical  duplicate  of  an  existing  process: 

pid  —  fork(); 
result  int  pid; 

The  fork  call  returns  twice,  once  in  the  parent  process,  where  pid  is  the  process  identifier  of  the 
child,  and  once  in  the  child  process  where  pid  is  0.  The  parent-child  relationship  induces  a 
hierarchical  structure  on  the  set  of  processes  in  the  system. 

A  process  may  terminate  by  executing  an  exit  call: 

exit(status) 
int  status; 

returning  8  bits  of  exit  status  to  its  parent. 

When  a  child  process  exits  or  terminates  abnormally,  the  parent  process  receives  informa¬ 
tion  about  any  event  which  caused  termination  of  the  child  process.  A  second  call  provides  a 
non-blocking  interface  and  may  also  be  used  to  retrieve  information  about  resources  consumed  by 
the  process  during  its  lifetime. 

#indude  <«ys/wait.h> 

pid  —  wait(astatus), 

result  int  pid;  result  union  wait  *astatus; 

pid  —  wait3(astatus,  options,  arusage); 
result  int  pid;  result  unioD  waitstatus  *astatus; 
int  options;  result  struct  rusage  *  arusage, 

A  process  can  overlay  itself  with  the  memory  image  of  another  process,  pasting  the  newly 
created  process  a  set  of  parameters,  usiug  the  call: 
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execve(name,  argv,  envp) 
char  *name,  ••argv,  •*envp; 

The  specified  name  must  be  a  file  which  is  in  a  format  recognised  by  the  system,  either  a  binary 
executable  file  or  a  file  which  causes  the  execution  of  a  specified  interpreter  program  to  process  its 
contents. 

1.1.1.  User  and  group  Ids 

Each  process  in  the  system  has  associated  with  it  two  user-id’s:  a  res/  eser  id  and  a  tfftetivc 
mtr  id,  both  non-negative  16  bit  integers.  Each  process  has  an  res/  oeeounting  g roup  id  and  an 
ufftrtivc  octounting  group  id  and  a  set  of  seeese  group  id's.  The  group  id’s  are  non-negative  16 
bit  integers.  Each  process  may  be  in  several  different  access  groups,  with  the  maximum  con¬ 
current  number  of  access  groups  a  system  compilation  parameter,  the  constant  NGROUPS  in  the 
file  <sys/param.h>,  guaranteed  to  be  at  least  8. 

The  real  and  effective  user  ids  associated  with  a  process  are  returned  by: 

ruid  —  getuidQ; 
result  int  ruid; 

euid  ■»  geteuidQ; 
result  int  euid; 

the  real  and  effective  accounting  group  ids  by: 

rgid  —  getgidQ; 
result  int  rgid; 

egid  —  getegid(); 
result  int  egid; 

and  the  access  group  id  set  is  returned  by  a  gctgroupt  call: 

ngroups  ■■  getgroups(gidsetsiie,  gidset); 

result  int  ngroups;  int  gidsetsire;  result  int  gidsetjgidsetsiie); 

The  user  and  group  id’s  are  assigned  at  login  time  using  the  ottreuid,  tttrtgii,  and  tetgroupt 

calls: 

setreuid(ruid,  euid); 
int  ruid,  euid; 

aetregid(rgid,  egid); 
int  rgid,  egid; 

setgroups(gidsetsize,  gidset) 

int  gidsetsire;  int  gidset[gidsetsire]; 

The  tetreuid  call  sets  both  the  real  and  effective  userid’s,  while  the  oetrtgid  call  sets  both  the  real 
and  effective  accounting  group  id’s.  Unless  the  caller  is  the  super- user,  mid  must  be  equal  to 
cither  the  current  real  or  effective  user-id,  and  rgid  equal  to  either  the  current  real  or  effective 
accounting  group  id.  The  $<tgroupi  call  is  restricted  to  the  superuser. 

1JL.4.  Process  groups 

Each  process  in  the  system  is  also  normally  associate  .;  with  a  process  group.  The  group  of 
processes  in  a  process  group  is  sometimes  referred  to  as  a  job  and  manipulated  by  high-level  sys¬ 
tem  software  (such  as  the  shell).  The  current  process  group  of  a  process  is  returned  by  the 
fttpgrp  call: 
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pgrp  -»  getpgrp(pid); 
result  int  pgrp;  int  pid ; 

When  a  process  is  in  a  specific  process  group  it  msy  receive  software  interrupts  affecting  the 
group  causing  the  group  to  suspend  or  resume  execution  or  to  be  interrupted  or  terminated.  In 
particular,  a  system  terminal  has  a  process  group  and  only  processes  which  art  in  the  process 
group  of  the  terminal  may  read  from  the  terminal,  allowing  arbitration  of  terminals  among 

several  different  jobs. 

The  process  group  associated  with  a  process  may  be  changed  by  the  sefpfrp  call: 

setpgrp(pid,  pgrp); 
int  pid,  pgrp; 

Newly  created  processes  are  assigned  process  id’s  distinct  from  all  processes  and  process  groups, 
and  the  same  process  group  as  their  parent.  A  normal  (unprivileged)  process  may  set  its  process 
group  equal  to  its  process  id.  A  privileged  process  may  set  the  process  group  of  any  process  to 

any  value. 


1.2.  Memory  management! 


1.2.1.  Text,  data  and  atack 

Each  process  begins  execution  with  three  logical  areas  of  memory  called  text,  data  and 
auck.  The  text  area  is  read-only  and  shared,  while  the  data  and  sUck  areas  are  private  to  the 
process.  Both  the  data  and  stack  areas  may  be  extended  and  contracted  on  program  request 

The  call 

addr  —  sbrk(incr); 

result  eaddr_t  addr;  int  iner; 

changes  the  die  of  the  data  area  by  incr  bytes  and  returns  the  new  end  of  the  data  area,  while 

addr  *  astk(incr); 

result  caddr_t  addr;  int  incr; 

chances  the  tise  of  the  stack  area.  The  stack  area  is  also  automatically  extended  as  needed.  On 
the  VAX  the  text  and  data  areas  are  adjacent  in  the  PO  region,  while  the  stack  section  is  in  the 

Pi  region,  and  grows  downward. 

1.2.2  Mapping  pages 

The  system  supports  sharing  of  data  between  Processes  by  allowing  pages  to  be  mapped  rnto 
J,  Tb»«  m.p£d  P««  ».y  K  •*«"'  "M-  o«k«r  proc««  or  pm.*  ««  lb<  P™~ 
tection  and  sharing  options  are  defined  in  <mmaD.h>  as: 


misweUos'T^esU  tb,  l.Urfar,  rlaaotd  rtlmt.  of  tb.  system.  Of  tb.  edl.  described  is  UU 

•tctloB,  only  tkrk  and  art  laeludtd  ia  4.2BSD. 
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/•  p  Hections  are  chosen  from  these  bits,  or-ed  together  •/ 

K'fi1*6  pc  I  0x4  I*  pafes  c“  **  reid  */ 

pd^t’p^  0x2  /*  p,kgw  cw>  ^  written  */ 

#define  PROT_EXEC  Oxl  /•  pages  can  be  executed  */ 

/•  sharing  types;  choose  either  SHARED  or  PRIVATE  •  / 
f  define  MAP.SHARED  1  /•  share  changes  •/ 

#define  MAP_PRIVATE  2  /*  changes  are  private  •/ 

The  cpu-dependent  rise  of  a  page  is  returned  *>y  the  pttpagttuc  system  call:  ' 

pagesixe  —  getpagesise(); 
result  int  pagesixe; 


The  call 


snmap{  iddr,  len,  prot,  share,  fd,  pos); 
eaddr_t  addr;  int  len,  prc  hare,  fd;  off_t  pos; 


causes  the  pages  starting  at  arf*  and  u^nuint  to  Un  byte,  to  be  mapped  from  the  object 
"ST*1!*  by  drC"Pttr  f  '  at  ‘b#°lut€  P0^011  F®*-  Tbe  psrameter  sAsre  specifies  whether 

40  “aPPed  C°Py  °f  1116  pfcge’  "*  *°  **  kcPl  or  are  to  be  sAsrerf 

with  other  references.  The  parameter  prot  specifies  the  accessibility  of  the  mapped  pares  The 
*ddr,  Un,  and  pos  parameter*  must  all  be  multiples  of  the  pagesixe. 


A  process  can  move  pages  within  its  own  memory  by  using  the  mremap  eall: 

mremap(addr,  len,  prot,  share,  fromaddr); 
eaddr.t  addr;  int  len,  prot,  share;  caddr_t  fromaddr; 


This  cal)  maps  the  pages  starting  at  fromtddr  to  the  addieas  specified  by  nddr. 
A  mapping  can  be  removed  by  the  call 


munmap(addr,  len); 
eaddr_t  addr;  int  len; 

This  causes  further  references  to  these  pages  to  refer  to  private  pages  initialised  to  sero. 


1*2  3  Page  protection  control 

A  process  can  control  the  protection  of  pages  using  the  call 

mpro tec t{ addr,  len,  prot); 
eaddr_t  addr;  int  len,  prot; 

This  call  changes  the  specified  pages  to  have  protection  prot . 


1  *2*4.  Givbg  and  getting  advice 

A  process  that  has  knowledge  of  its  memory  behavior  may  use  the  modvitc  call: 

madviae(addr,  Jen,  behav); 
eaddr_t  addr;  int  len,  behav; 

Bektv  describes  expected  behavior,  as  given  in  <mmr-n.h>  . 


#define  MADV_NORMAL  0 
#define  MADVJtANDOM  1 
#define  MADV_SEQUENTIAL  2 
#define  MADV_WILLNEED  3 
#define  MADVJ)ONTNEED  4 

Finally ,  a  process  may  obtain  information  about 


/•  no  further  special  treatment  •  / 

/•  expect  random  page  references  */ 

/•  expect  sequential  references  •/ 

/*  will  need  these  pages  •/ 

/•  don't  need  these  pages  •/ 

hether  pages  are  core  resident  by  using  the  call 
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mincore(addr,  Ien,  vec) 

caddr_t  addr;  int  len;  result  char  *vec; 

Here  the  current  core  residency  of  the  pages  is  returned  in  the  character  array  vtt,  with  a  value  of 
1  meaning  that  the  page  is  in-core. 


1J.  Signals 


1.9.1.  Overview 

The  system  defines  a  set  of  signals  that  may  he  delivered  to  a  process.  Signal  delivery 
resembles  the  occurrence  of  a  hardware  interrupt:  the  signal  is  blocked  from  further  occurrence 
the  cunvnt  process  context  is  saved,  and  a  new  one  is  built.  A  process  may  specify  the  handler  to 
which  a  signal  is  delivered,  or  specify  that  the  signal  is  to  be  kloeked  or  ignored.  A  process  may 
sdso  specify  that  a  default  action  is  to  be  taken  when  signals  occur. 

Some  signals  will  cause  a  process  to  exit  when  they  art  not  caught.  This  may  be  accom¬ 
panied  by  creation  of  a  cere  image  file,  containing  the  current  memory  image  of  the  process  fir 
use  in  post-mortem  debugging  A  process  may  choose  to  have  signals  delivered  on  a  special  stack, 
so  that  sophisticated  software  stack  manipulations  are  possible. 

All  signals  have  the  same  priority.  If  multiple  signals  are  pending  simultaneously,  the  order 
in  which  they  are  delivered  to  a  process  is  implementation  specific.  Signal  routiner  execute  with 
the  signal  that  caused  their  invocation  hleektd,  but  other  signals  may  yet  occur.  Mechanisms  are 
provided  whereby  critical  sections  of  code  may  protect  themselves  against  the  occurrence  of 
specified  signals. 

1.8.2.  Signal  types 

The  signals  defined  by  the  system  fall  into  one  of  five  elasses:  hardware  conditions,  software 
conditions,  input/output  notification,  process  control,  or  resource  control.  The  set  of  signals  is 
defined  in  the  file  <signal.h>. 

Hardware  signals  are  derived  from  exceptional  conditions  which  may  occur  during  execution. 
Such  signals  include  S1GFPE  representing  floating  point  and  other  arithmetic  exceptions,  S1GILL 
for  illegal  instruction  execution,  SIGSEGV  for  addresses  outside  the  currently  assigned  area  of 
memory,  and  SIGBUS  for  accesses  that  violate  memory  protection  constraints.  Other,  more  epu- 
specific  hardware  signals  exist,  such  as  those  for  the  various  customer-reserved  instructions  on  the 
VAX  (SIGIOT,  S1GEMT,  and  S1GTRAP). 

Software  signals  reflect  interrupts  generated  by  user  request:  SI G INT  for  the  normal  inter¬ 
rupt  signal;  S1GQUIT  for  the  more  powerful  quit  signal,  that  normally  causes  a  core  image  to  be 
generated;  SIGHUP  and  S1GTERM  that  cause  graceful  process  termination,  either  because  a  user 
has  “hung  up”,  or  by  user  or  program  request;  and  S1GKJLL,  a  more  powerful  termination  signs! 
which  a  process  cannot  eatch  or  ignore.  Other  software  signals  (S1GALRM,  SIGVTALRM,  S1G- 
PROF)  indicate  the  expiration  of  interval  timers. 

A  process  can  request  notification  via  &  S1G10  signal  when  input  or  output  is  possible  on  a 
descriptor,  or  when  a  non-blocking  operation  completes.  A  process  may  request  to  receive  a 
SIGURG  signal  when  an  urgent  condition  arises. 

A  process  may  be  stopped  by  a  signal  sent  to  it  or  the  members  of  its  process  group.  The 
SIGSTOP  signal  is  a  powerful  stop  signal,  because  it  cannot  be  caught.  Other  stop  signals 
SIGTSTP,  S1GTTIN,  and  S1GTTOU  are  used  when  a  user  request,  input  rsqusst,  or  output 
request  tespectively  is  the  reason  the  process  is  being  stopped.  A  S1GCONT  signal  is  sent  to  a 
process  when  it  is  continued  from  a  stopped  state.  Processes  may  receive  notification  with  a 
SIGCHLD  signal  when  a  child  process  changes  state,  either  by  stopping  or  by  terminating. 
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Exceeding  resource  limits  may  cause  signals  to  be  generated.  SIGXCPU  occurs  when  a  pro¬ 
cess  nears  its  CPU  time  limit  and  SIGXFSZ  warns  that  the  limit  on  file  site  creation  has  been 
reached. 

1  3.8.  Signal  handlers 

A  process  has  a  handler  associated  with  each  signal  that  controls  the  way  the  signal  is 
delivered  The  call 

#inelude  <signal.h> 

struct  tigvec  {  > 


int 

(•sv_handlerX); 

int 

sv_mask; 

int 

sv_onstack; 

}; 


sigvec(signo,  sv,  osv) 

int  signo;  struct  aigvec  *sv;  result  struct  sigvee  *cwv; 

assigns  interrupt  handler  address  av_handlcr  to  signal  aigno.  Each  handler  address  specifies  either 
an  interrupt  routine  for  the  signal,  that  the  signal  is  to  be  ignored,  or  that  a  default  action  (usu¬ 
ally  process  termination)  it  to  occur  if  the  signal  occurs.  The  constants  SIGJGN  and  SIG_PEF 
used  as  values  for  av_kandlcr  cause  ignoring  or  defaulting  of  a  condition.  The  avjmaak  and 
av^onatack  values  specify  the  signal  mask  to  be  used  when  the  handler  is  invoked  and  whether  the 
handler  should  operate  on  the  normal  run-time  stack  or  a  special  signal  stack  (see  below).  If  cat 
is  non-sero,  the  previous  signal  vector  is  returned. 

When  a  signal  condition  arises  for  a  process,  the  signal  is  added  to  a  set  of  signals  pending 
for  the  process.  If  the  signal  is  not  currently  Hocked  by  the  process  then  it  will  be  delivered.  The 
process  of  signal  delivery  adds  the  signal  to  be  delivered  and  those  signals  specified  in  the  associ¬ 
ated  signal  handler's  av_moak  to  a  set  of  those  masked  for  the  process,  saves  the  current  process 
context,  and  places  the  process  in  the  context  of  the  rngnal  handling  routine.  The  call  is  arranged 
so  that  if  the  signal  handling  routine  exits  normally  the  signal  mask  will  be  restored  and  the  pro¬ 
cess  will  resume  execution  in  the  original  context.  If  the  process  wishes  to  resume  in  a  different 
context,  then  it  must  arrange  to  restore  the  signal  mask  itself. 

The  mask  of  Hocked  signals  is  independent  of  handlers  for  signals.  It  prevents  signals  from 
being  delivered  much  as  a  raised  hardware  interrupt  priority  level  prevents  hardware  interrupts. 
Preventing  an  interrupt  from  occurring  by  changing  the  handler  is  analogous  to  disabling  a  device 
from  further  interrupts. 

The  signal  handling  routine  av_kanUcr  is  called  by  a  C  call  of  the  form 

(•sv_ba.'41erX»igno,  code,  sep); 

int  signo;  long  code;  struct  sigeontext  *scp; 

The  signs  gives  he  number  of  the  signal  that  occurred,  and  the  cede,  a  word  of  information  sup¬ 
plied  by  the  hardware.  The  sep  parameter  is  a  pointer  to  a  machine-dependent  structure  contain¬ 
ing  the  information  for  restoring  the  context  before  the  signal. 

18. 4  Sending  signals 

A  process  can  send  a  signal  to  another  process  or  group  of  processes  with  the  calls: 

kill(pid,  signo) 
int  pid,  signo; 

killpgrp(pgrp,  signo) 
int  pgrp,  signo; 
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long  mask, 


1J.C.  Signal  stacks 

Applications  that  maintain  complex  or  fixed  site  stacks  can  use  the  call 
struct  sigstack  { 


caddr. 

int 


«_*p; 
ss_on  stack; 
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#>nclude  <sys/time.h> 

settimeofday(tvp,  tip); 
struct  timeval  *tp; 

struct  timexone  *txp;  » 

gettimeofday(tp,  tip); 
result  struct  timeval  *tp; 
result  struct  timexone  *tzp; 

where  the  structures  are  defined  in  <sys/time.h>  as: 

struct  timeval  { 

long  tv_sec;  /*  seconds  since  Jan  1,  1970  •/ 

long  tv  usee;  /*  and  microseconds  */ 

}; 

struct  timexone  { 

int  txjtninuteswest;  /*  of  Greenwich  •/ 

int  tx  dsttime;  /•  type  of  dst  correction  to  apply  •/ 

}; 

Earlier  versions  of  UNIX  contained  only  a  1-second  resolution  version  of  this  call,  which  remains 
as  a  library  routine: 

time(tvsec) 
result  long  *tvsec; 

returning  only  the  tv_sec  field  from  the  fcttimcofday  call. 

1.4.2.  Interval  time 

The  system  provides  each  process  with  thn^  interval  timers,  defined  in  <sys/time.h>: 

^define  ITIMERJREAL  0  /*  real  time  intervals  •/ 

#define  ITIMERJVIR TU AL  1  /*  virtual  time  intervals  */ 

#define  ITIMER_PROF  2  /*  user  and  system  virtual  time  */ 

The  ITIMER_REAL  timer  decrements  in  real  time .  It  could  be  used  by  a  library  routine  to  main¬ 
tain  a  wakeup  service  queue.  A  SIGALRM  signal  is  delivered  when  this  timer  expires. 

The  ITIMER_VIRTUAL  timer  decrements  in  process  virtual  time.  It  runs  only  when  the 
process  is  executing.  A  SIGVTALRM  signal  is  delivered  when  it  expires. 

The  ITIMER_PROF  timer  decrements  both  in  process  virtual  time  and  when  the  system  is 
running  on  behalf  of  the  process.  It  is  designed  to  be  used  by  processes  to  statistically  profile 
their  execution.  A  SIGPROF  signal  is  delivered  when  it  expires. 

A  timer  value  is  defined  by  the  i(imervs/  structure: 
struct  itimerval  { 

struct  timeval  itjnterval;  /*  timer  interval  •/ 

struct  timeval  itjvalue;  /*  current  value  •/ 

}; 

and  a  timer  is  set  or  read  by  the  call: 
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getitimer(  which,  value); 

int  which,  result  struct  itimerval  ‘value; 

setitimer(which,  value,  ovalue); 

int  which;  struct  itimerval  ‘value;  result  uruct  itimerval  *ovalue;  m 

The  third  argument  to  setitimer  specifies  an  optional  structure  to  receive  the  previous  contents  of 
the  interval  timer.  A  timer  can  be  disabled  by  specifying  a  timer  value  of  0. 

The  system  rounds  argument  timer  intervals  to  be  not  less  than  the  resolution  of  its  dock. 
This  clock  resolution  can  be  determined  by  loading  a  very  small  value  into  a  timer  and  reading 
the  timer  back  to  see  what  value  resulted. 

The  slsrm  system  call  of  earlier  versions  of  UNIX  is  provided  as  a  library  routine  using  the 
mMEk_REAL  timer.  The  process  profiling  facilities  of  earlier  versions  of  UNIX  remain  because 
it  is  not  always  possible  to  guarantee  the  automatic  restart  of  system  calls  after  receipt  of  a  sig¬ 
nal. 


profil(buf,  bufsize,  ofiset,  scale); 
result  char  *buf;  int  bufsize,  ofiset,  scale; 


1.5.  Descriptors 


1.5.1.  The  reference  table 

Each  process  has  access  to  resources  through  descriptors.  Each  descriptor  is  a  handle  allow¬ 
ing  the  process  to  reference  objects  such  as  files,  devices  and  communications  links. 

Rather  than  allowing  processes  direct  access  to  descriptors,  the  system  introduces  a  level  of 
indirection,  so  that  descriptors  may  be  ahared  between  processes.  Each  process  hiw  a  descriptor 
reference  table,  containing  pointers  to  the  actual  descriptors.  The  descriptors  themselves  thus 
have  multiple  references,  and  are  reference  counted  by  the  system. 

Each  process  has  a  fixed  size  descriptor  reference  table,  where  the  size  is  returned  by  the 
fetdleblesize  call: 

nds  *  getdtablesizeQ; 
result  int  nds; 

and  guaranteed  to  be  at  least  20.  The  entries  in  the  descriptor  reference  table  are  referred  to  by 
small  integers;  for  example  if  there  are  20  slots  they  are  numbered  0  to  10. 

1.5.2.  Descriptor  properties 

Each  descriptor  has  a  logical  set  of  properties  maintained  by  the  system  and  defined  by  its 
tppe  Each  type  supports  a  set  of  operations;  some  operations,  such  as  reading  and  writing,  are 
common  to  several  abstractions,  while  others  are  unique.  The  generic  operations  applying  to 
many  of  these  types  are  described  in  section  2.1.  Naming  contexts,  files  and  directories  are 
described  in  section  2.2.  Section  2.3  describes  communications  domains  and  sockets.  Terminals 
and  (structured  and  unstructured)  devices  are  described  in  section  2.4. 

1.5.8.  Managing  descriptor  references 


K 


m 

i 


A  duplicate  of  a  descriptor  reference  may  be  made  by  doing 

new  mm  dup(old); 
result  int  new;  int  old, 

returning  a  copy  of  descriptor  reference  old  indistinguishable  from  the  original.  The  sew  chosen 
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by  the  system  will  be  the  smallest  unused  descriptor  reference  alot.  A  copy  of  a  descriptor  refer¬ 
ence  may  be  made  in  *  specific  slot  by  doing 

dup2(old,  new); 
int  old,  new; 

The  4%pt  cell  causes  the  system  to  deallocate  the  descriptor  reference  current  occupying  alot  ecu;, 
if  any,  replacing  it  with  a  reference  to  the  same  descriptor  as  old.  This  deallocation T*  also  per¬ 
formed  by. 

close(old);  / 

int  old; 

1 .5.4.  Multiplexing  requests 

The  system  provides  a  standard  way  to  do  synchronous  and  asynchronous  multiplexing  of 
operations. 

Synchronous  multiplexing  is  performed  by  using  the  ttltct  call: 

nds  *■=  select{nd,  in,  out,  except,  tvp); 
result  int  nds;  int  nd;  result  *in,  *out,  ‘except, 
struct  timeval  ‘tvp; 

The  select  call  examines  the  descriptors  specified  by  the  sets  is,  esf  and  except,  replacing  the 
specified  bit  masks  by  the  subsets  that  select  for  input,  output,  and  exceptional  conditions  respec¬ 
tively  (nd  indicates  the  sise,  in  bytes,  of  the  bit  masks).  If  any  descriptors  meet  the  following  cri¬ 
teria,  then  the  number  of  such  descriptors  is  returned  in  mis  and  the  bit  masks  are  updated. 

•  A  descriptor  selects  for  input  if  an  input  oriented  operation  such  as  read  or  receive  is  possi¬ 
ble,  or  if  a  connection  request  may  be  accepted  (see  section  2.3. 1.4). 

'  e 

•  A  descriptor  selects  for  output  if  an  output  oriented  operation  such  as  write  or  send  is  possi¬ 
ble,  or  if  an  operation  that  was  "in  progress”,  such  as  connection  establishment,  has  com¬ 
pleted  (see  section  2.1.3). 

•  A  descriptor  selects  for  an  exceptional  condition  if  a  condition  that  would  cause  a  SIGURG 
signal  to  be  generated  exists  (see  section  1.3.2). 

If  none  of  the  specified  conditions  is  true,  the  operation  blocks  for  at  most  the  amount  of  time 
specified  by  tvp,  or  waits  for  one  of  the  conditions  to  aripe  if  tvp  is  given  as  0. 

Options  affecting  i/o  on  a  descriptor  may  be  read  and  set  by  the  call: 

dopt  ■»  fcntl(d,  cmd,  arg) 
result  int  dopt;  int  d,  cmd,  arg; 

/•  interesting  values  for  cmd  •/ 

#define  F_SETFL  3  /•  set  descriptor  options  •/ 

#define  F_GETFL  4  /•  get  descriptor  options  •/ 

^define  F_SETOWN  5  /•  set  descriptor  owner  (pid/pgrp)  */ 

# define  F_GETOWN  6  /•  get  descriptor  owner  (pid/pgrp)  •/ 

The  FJSETFL  cmd  may  be  used  to  set  a  descriptor  in  non-blocking  i/o  mode  and/or  enable  sig- 
aalling  when  i/o  is  possible.  FJSETOWN  may  be  used  to  specify  a  process  or  process  group  to  be 
mgn ailed  when  using  the  latter  mode  of  operation. 

Operations  on  non-blocking  descriptors  will  either  immediately,  note  an  error 

EWOULDBLOCK,  partially  complete  an  input  or  output  -etuming  a  partial  count,  or 

return  an  error  EINPROGRESS  noting  that  the  request*  j  in  progress.  A  descriptor 

which  has  signalling  enabled  will  cause  the  specified  proc  j/or  process  group  be  signaled, 
with  a  S1G10  for  input,  output,  or  in-progress  operation  complete,  or  a  FIGURG  for  exceptional 
conditions 
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For  example,  when  writing  to  a  terminal  using  non-blocking  output,  the  system  will  accept 
only  as  much  data  as  there  is  buffer  space  for  and  return;  when  making  a  connection  on  a  socket, 
the  operation  may  return  indicating  that  the  connection  establishment  is  “in  progress".  The 
select  facility  can  be  used  to  determine  when  further  output  is  possible  on  the  terminal,  or  when 
the  connection  establishment  attempt  is  complete. 

1.5.5.  Descriptor  wrapping-!  \ 

A  user  process  may  build  descriptors  of  a  specified  type  by  wrepping  a  communications 
channel  with  a  system  supplied  protocol  translator: 

new  ■»  wrap(old,  proto) 

result  int  new;  int  old;  struct  dprop  *proto; 

Operations  on  the  descriptor  old  are  then  translated  by  the  system  provided  protocol  translator 
into  requests  on  the  underyling  object  old  in  a  way  defined  by  the  protocol.  The  protocols  sup¬ 
ported  by  the  kernel  may  vary  from  system  to  system  and  are  described  in  the  programmers 
manual. 

Protocols  may  be  based  on  communications  multiplexing  or  a  rights-passing  style  of  han¬ 
dling  multiple  requests  nade  on  the  same  object.  For  instance,  a  protocol  for  implementing  a  file 
abstraction  may  or  may  not  include  locally  generated  “read-ahead"  requests.  A  protocol  that 
provides  for  read-ahead  may  provide  higher  performance  but  have  a  more  difficult  implementa¬ 
tion. 

Another  example  is  the  terminal  driving  facilities.  Normally  a  terminal  is  associated  with  a 
communications  line  and  the  terminal  type  and  standard  terminal  access  protocol  is  wrapped 
around  a  synchronous  communications  line  and  given  to  the  user.  If  a  virtual  terminal  is 
required,  the  terminal  driver  can  be  wrapped  around  a  communications  link,  the  other  end  of 
which  is  held  by  a  virtual  terminal  protocol  interpreter. 


1.6.  Resource  controls 
1.6.1.  Process  priorities 

The  system  gives  CPU  scheduling  priority  to  processes  that  have  not  nsed  CPU  time 
recently.  This  tends  to  favor  interactive  processes  and  processes  that  execute  only  for  short 
periods.  It  is  possible  to  determine  the  priority  currently  assigned  to  a  process,  process  group,  oi 
the  processes  of  a  specified  user,  or  to  alter  this  priority  using  the  calls: 

#define  PRIO_PROCESS  0  /•  process  •/ 

#define  PRIO_PGRP  1  /•  process  group  •/ 

^define  PR10_USER  2  /•  user  id  •/ 

prio  getpriority(which,  who); 
result  int  prio;  int  which,  who; 

setpriority(whieh,  who,  prio); 
int  which,  who,  prio; 

The  value  prio  is  in  the  range  -20  to  20.  The  default  priority  is  0;  lower  priorities  cause  more 
favorable  execution.  The  gctprioritf  call  returns  the  highest  priority  (lowest  numerical  value) 
•njoyed  by  any  of  the  specified  processes.  The  sctprioritf  call  sets  the  priorities  of  all  of  the 
specified  processes  to  the  specified  value.  Only  the  super-user  may  lower  priorities. 

t  Tht  fscilitit*  described  is  this  section  srt  sot  iscivdtd  is  4.ZBSD. 
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#define  RLIMTT_CPU  0  /•  cpu  time  in  milliseconds  •/ 

^define  RLIMItIfSIZE  1  /•  maximum  file  cite  •/ 

^define  RLIMITJDATA  2  /•  maximum  data  segment  cite  •/ 

^define  RLDvlIT_STACK  8  /•  maximum  ctack  segment  sise  •/ 

#define  RLIMlT_CORE  4  /•  maximum  core  file  cite  •/ 

^define  RLIMIT _ftSS  6  /*  maximum  resident  act  cite  •/ 

#define  RLIM_NLIMITS  6 

#define  RLIMJNFIN1TY  0x7fnTfff 

ctruct  rlimit  { 
int 
int 

}; 

getrlimit(reaource,  rip) 

int  resource;  result  struct  rlimit  *rlp; 

se trli mi t( resource,  rip) 

int  resource;  ctruct  rlimit  *rlp; 

Only  the  super-user  can  raise  the  maximum  limits.  Other  users  may  only  alter  rlim^cvr 
within  the  range  from  0  to  or  (irreversibly)  lower  rUm_maz 


1.7.  System  operation  support 

Unless  noted  otherwise,  the  calls  in  this  section  are  permitted  only  to  a  privileged  user. 

1.7.1.  Bootstrap  operations 
The  call 

mount(blkdev,  dir,  ronly); 
char  *b!kdev,  *dir;  int  ronly; 

extends  the  UNIX  name  apace.  The  movnl  call  specifies  a  block  device  klkitv  containing  a  UNIX 
file  system  to  be  made  available  darting  at  Hr.  If  rsa/y  is  set  then  the  file  system  is  read-only; 
writes  to  the  file  system  will  not  be  permitted  and  access  times  will  not  be  updated  when  files  are 
referenced.  Dir  is  normally  a  name  in  the  root  directory. 

The  call 

swapon(blkdev,  size); 
char  •blkdev;  int  size; 

specifies  a  device  to  be  made  available  for  paging  and  swapping. 

1.7 .2.  Shutdown  operations 
The  call 

unmount(dir); 
char  *dir; 

unmounts  the  file  system  mounted  on  Hr.  This  call  will  succeed  only  if  the  file  system  is  not 
currently  being  used. 


rli m_cur;  /•  current  (soft)  limit  •/ 

rlim_max,  /•  hard  limit  •/ 
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The  cell 
«ync(); 

schedules  input/output  to  clem  ell  system  buffer  caches  (This  cell  does  not  require  priveleged 
status.) 

The  cell 

m 

reboot(how) 
int  how; 

causes  e  machine  halt  or  reboot.  The  cell  may  request  a  reboot  by  specifying  how  as 
RELAUTOBOOT,  or  that  the  machine  be  halted  with  RB_HALT.  These  constants  are  defined  m 
<*y*/reboot.h>. 

1.7.8.  Accounting 

The  system  optionally  keeps  an  accounting  record  in  a  file  for  each  process  that  exits  on  the 
system.  The  format  of  this  record  is  beyond  the  scope  of  this  document.  The  accounting  may  be 
enabled  to  a  file  seme  by  doing 

acct(path); 
char  "path; 

If  ftth  is  null,  then  accounting  is  disabled.  Otherwise,  the  named  file  becomes  the  accounting  file. 

2.  System  facilities 

This  section  discusses  the  system  facilities  that  are  not  considered  part  of  the  kernel. 

The  system  abstractions  described  are: 

Directory  contexts 

A  directory  context  is  a  position  in  the  UNIX  file  system  name  space.  Operations  on  files 
and  other  named  objects  in  a  file  system  are  always  specified  relative  to  such  a  context. 

Files 

Files  are  used  to  store  uninterpreted  sequence  of  bytes  on  which  random  access  retit  and 
writes  may  occur.  Pages  from  files  may  also  be  mapped  into  process  address  space.  A 
directory  may  be  read  as  a  filet- 

Communications  domains 

A  communications  domain  represents  an  interprocess  communications  environment,  such  as 
the  communications  facilities  of  tbe  UNIX  system,  communications  in  the  INTERNET,  or 
the  resource  sharing  protocols  and  access  rights  of  a  resource  sharing  system  on  a  local  net¬ 
work. 

Sockets 

A  socket  is  an  endpoint  of  communication  sod  tbe  local  point  for  PC  in  a  communications 
domain.  Sockets  may  be  created  in  pairs,  or  given  names  rnd  used  to  rendesvous  with  other 
sockets  in  a  communications  domain,  accepting  connections  from  these  sockets  or  exchang¬ 
ing  messages  with  them.  These  operations  model  a  labeled  or  unlabeled  communications 
graph,  and  can  be  used  in  a  wide  variety  of  communications  domains.  Sockets  can  have 
different  typtt  to  provide  different  semantics  of  communication,  increasing  the  flexibility  of 
the  model. 

Terminals  and  other  devices 

Devices  include  terminals,  providing  input  editing  and  interrupt  generation  and  output  Sow 
1  Sip  port  for  raappist  Sin  »  sot  itcladtd  it  Ut  4.2  rtieut. 
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control  and  editing,  magnetic  tapes,  disks  and  other  peripherals.  They  often  support  the 
generic  read  and  «n(e  operations  as  well  as  a  number  of  ioeti  a. 

Processes 

Process  descriptors  provide  facilities  for  control  and  debugging  of  other  processes. 


SJL.  Generic  operations 

Many  system  abstractions  support  the  operations  read,  write  and  ioeti.  We  describe  the 
bases  of  these  common  primitives  here.  Similarly,  the  mechanises  whereby  normally  synch  o 
nous  operations  may  occur  in  a  non-blocking  or  asynchronous  fashion  are  common  to  all  system- 
defined  abstractions  and  are  described  here. 

2.1.1.  Read  and  write 

The  read  and  write  system  calls  can  be  applied  to  communications  channels,  files,  terminals 
and  devices.  They  have  the  form: 

ee  read(fd,  buf,  nbytes); 

result  int  ce;  int  fd,  i;'n.  It  caddr_t  buf;  int  nbytes; 

cc  «  write(fd,  buf,  nbytes); 

result  int  cc;  int  fd;  caddr_t  buf;  int  nbytes; 

The  read  call  transfers  as  much  data  as  possible  from  the  object  defined  by  fd  to  the  buffer  at 
address  t«/  of  site  a ijles.  The  number  of  bytes  transferred  is  returned  in  ee,  which  is  -1  if  a 
return  occurred  before  any  data  was  transferred  because  of  an  error  or  use  of  non-blocking  opera¬ 
tions. 

The  write  call  transfers  data  from  the  buffer  to  the  object  defined  by  fd.  Depending  on  the 
type  of  fd,  it  is  possible  that  the  write  call  will  accept  some  portion  of  the  provided  bytes;  the  user 
should  resubmit  the  other  bytes  in  a  later  request  in  this  case.  Error  returns  because  of  inter¬ 
rupted  or  otherwise  incomplete  operations  are  possible. 

Scattering  of  data  on  input  or  gathering  of  data  for  output  is  also  possible  using  an  array  of 
input/output  vector  descriptors.  The  type  for  the  descriptors  is  defined  in  <sys/uio.h>  as: 

struct  iervee  { 

caddr_t 
int 

}; 

The  calls  using  an  array  of  descriptors  are: 

ee  «  readv(fd,  iov,  iovlen), 

result  int  cc;  int  fd;  struct  iovee  *iov;  int  iovlen; 

cc  — «  writev(fd,  iov,  iovlen); 

result  int  cc;  int  fd;  struct  iovee  *iov;  int  iovlen; 

Here  iov/en  is  the  count  of  elements  in  the  iov  array. 

2.1.2.  Input/output  control 

Control  operations  on  an  object  are  performed  by  the  ioeti  operation: 

»octl(fd,  request,  buffer); 
int  fd,  request;  caddr.t  buffer; 

This  operation  causes  the  specified  reqveat  to  be  performed  on  the  object  fd.  The  rtqveat 


iov_msg;  /•  base  of  a  component  •/ 

iovjen;  /•  length  of  a  component  •/ 
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parameter  specifies  whether  the  argument  buffer  is  to  be  read,  written,  read  and  written,  or  is  not 
needed,  and  also  the  sue  of  the  buffer,  as  well  as  the  request.  Different  descriptor  types  and  sub- 
types  within  descriptor  typj*  may  use  distinct  icctl  requests.  For  example,  operations  on  termi¬ 
nals  control  Gushing  of  input  and  output  queues  and  setting  of  terminal  parameters;  operations  on 
disks  cause  formatting  operations  to  occur;  operations  on  tapes  control  tape  positioning. 

The  names  for  basic  control  operations  are  defined  in  <sys/ioctl.h>. 

S1.I.  Non-blocking  and  asynchronous  operations 

A  process  that  wishes  to  do  non-blocking  operations  on  one  of  its  descriptors  sets  the 
descriptor  in  non-blocking  mode  as  described  in  section  1.6.4.  Thereafter  the  read  call  will  return 
a  specific  EWOULDBLOCK  error  indication  if  there  is  no  data  to  be  reed  The  process  may 
iiclect  the  associated  descriptor  to  determine  when  a  read  is  possible. 

Output  attempted  when  a  descriptor  can  accept  less  than  is  requested  will  either  accept 
some  of  the  provided  data,  returning  a  shorter  than  norma!  length,  or  return  an  error  indicating 
that  the  operation  would  block.  More  output  can  be  performed  as  soon  as  a  *  elect  call  indicates 
the  object  is  writeable. 

Operations  other  than  data  input  or  output  may  be  performed  on  a  descriptor  in  a  non¬ 
blocking  fashion.  These  operations  will  return  with  a  characteristic  error  indicating  that  they  are 
in  progress  if  they  eannot  return  immediately.  The  descriptor  may  then  be  sc/ecled  for  write  to 
find  out  when  the  operation  ean  be  retried.  When  select  indicates  the  descriptor  is  writeable,  a 
respecification  of  the  original  operation  will  return  the  result  of  the  operation. 


1.2.  File  system 

s 

2.2.1.  Overview 

The  file  system  abstraction  provides  access  to  a  hierarchical  file  system  structure.  The  file 
system  contains  directories  (each  of  which  may  eontain  other  nib-directories)  as  well  as  files  and 
references  to  other  objects  such  as  devices  and  inter-process  communications  sockets. 

Each  file  is  organised  as  a  linear  array  of  bytes.  No  record  boundaries  or  system  related 
information  is  present  in  a  file.  Files  may  be  read  and  written  in  a  random-access  fashion.  The 
user  may  read  the  data  in  a  directory  as  though  it  were  an  ordinary  file  to  determine  the  names  of 
the  cootained  files,  but  only  the  system  may  write  into  the  directories.  The  file  system  stores  only 
a  small  amount  of  ownership,  protection  and  usage  information  with  a  file. 

2.2.2.  Naming 

The  file  system  calls  take  f*th  acme  arguments.  These  consist  of  a  sere  or  more  component 
file  acmes  separated  by  “/”  characters,  where  each  file  name  is  up  to  255  AS  CD  characters 
excluding  null  and  “/”• 

Each  process  always  has  two  naming  contexts:  one  for  the  root  directory  of  the  file  system 
and  one  for  the  current  working  directory.  These  are  used  by  the  system  in  the  filename  transla¬ 
tion  process.  If  a  path  name  begins  with  a  it  is  called  a  full  path  name  and  interpreted  rela¬ 
tive  to  the  root  directory  context.  If  the  path  name  does  not  begin  with  a  it  is  called  a  rela¬ 
tive  path  name  and  interpreted  relative  to  the  current  directory  context. 

The  system  limits  the  total  length  of  a  path  name  to  1024  characters. 

The  file  name  “..,k  in  each  directory  refers  to  the  parent  directory  of  that  directory.  The 
parent  directory  of  a  file  system  is  always  the  systems  root  directory. 

The  ealls 
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chdir(path); 
char  ‘path, 

chroot(path) 
char  ‘path, 

change  the  current  working  directory  and  root  directory  context  of  a  process.  Only  the  super- user 
can  change  the  root  directory  context  of  a  process.  " 

i.3.8.  Creation  Bind  removal 

The  file  system  allows  directories,  files,  special  devices,  and  “portals”  to  be  created  and 
removed  from  the  file  system. 

3.2. 8.1.  Directory  creation  and  removal 

A  directory  is  created  with  the  mkiir  system  call' 

mkdir(path,  mode); 
char  *path;  int  mode; 

and  removed  with  the  rmiir  system  call. 

rmdir(path); 
char  *path; 

A  directory  must  be  empty  if  it  is  to  be  deleted. 

3J2.3.2.  File  creation 

Files  are  created  with  the  eye  a  system  call, 

fd  opcn(path,  oflag,  mode); 

result  int  fd;  char  *path;  int  oflag,  mode; 

The  path  parameter  specifies  the  name  of  the  file  to  be  created.  The  oflag  parameter  must  include 
0_CREAT  from  below  to  cause  the  file  to  be  created.  The  protection  for  the  new  file  is  specified 


.  Bits  for 

oflag  are  defined  in 

<sys/file.h>: 

#define 

O.RDONLY 

000 

/•  open  for  reading  */ 

#define 

O.WRONLY 

001 

/•  open  for  writing  •/ 

#define 

OJRDWR 

002 

/•  open  for  read  &  write  •/ 

#define 

OJMDELAY 

004 

/•  noo-biocking  open  •/ 

# define 

O  .APPEND 

010 

/•  append  on  each  write  •/ 

#define 

0_CREAT 

01000 

/•  open  with  file  create  •/ 

#define 

O.TRUNC 

02000 

/•  open  with  truncation  •/ 

#define 

O  EXCL 

04000 

/•  error  on  create  if  file  exists  •/ 

One  of  OJtDONLY,  0_WR0NLY  and  0_JU)WR  should  be  specified,  indicating  what 
types  of  operations  are  desired  to  be  performed  on  the  open  file.  The  operations  will  be  checked 
agsin*t  the  user’s  access  rights  to  the  file  before  allowing  the  open  to  succeed.  Specifying 
0_APPEND  causes  writes  to  automatically  append  to  the  file.  The  Sag  O  .GREAT  causes  the  file 
to  be  created  if  it  does  not  exist,  with  the  specified  mode,  owned  by  the  current  user  and  the 
group  of  the  containing  dKetory. 

If  the  open  specifies  to  create  the  file  with  OJEXCL  and  the  file  already  exists,  then  the 
•pen  will  fail  without  affecting  the  file  m  any  way.  This  provides  a  simple  exclusive  access  facil¬ 
ity. 
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S2-S.8.  Creating  reference*  to  device* 

Tbe  file  system  allows  entries  which  leferancc.  peripheral  devices.  Peripheral*  are  dis¬ 
tinguished  as  1/eel  or  character  devices  according  by  their  ability  to  support  block-oriented  opera¬ 
tions.  Devices  are  identified  by  their  ‘‘major”  and  “minor”  device  numbers.  The  major  device 
■umber  determines  the  kind  of  peripheral  it  is,  while  the  minor  device  number  indicates  one  of 
possibly  many  peripherals  of  that  kind.  Structured  devices  have  all  operations  perfumed  inter¬ 
nally  in  “block”  quantities  while  unstructured  devices  often  have  a  number  of  special  sect/  opera¬ 
tions,  and  may  have  input  and  output  performed  in  large  units.  The  mined  call  mates  special 
entries: 


mknod(path,  mode,  dev); 
char  *path;  int  mode,  dev; 


where  mode  is  formed  from  the  object  type  and  access  permissions  The  parameter  dev  is  a 
configuration  dependent  parameter  used  to  identify  specific  character  or  block  i/o  devices. 


2.2.S.4.  Portal  ereationf 

The  call 

fd  “  portal(name,  server,  param,  dtype,  protocol,  domain,  socktypc) 
result  int  fd;  char  ‘name,  ‘server,  ‘param;  int  dtype,  protocol; 
int  domain,  sock  type, 

places  a  sime  in  the  file  system  name  space  that  causes  connection  to  a  server  process  when  the 
same  is  used.  The  portal  call  returns  an  active  portal  in  Jd  as  though  an  access  had  occurred  to 
activate  an  inactve  portal,  as  now  described. 

When  an  inactive  portal  is  accesseed,  the  system  sets  up  a  socket  of  the  specified  aocktppe  in 
the  specified  communications  domain  (see  section  2.S),  and  creates  the  server  process,  grving.it  the 
specified  param  as  argument  to  help  it  identify  the  portal,  and  also  giving  it  the  newly  created 
socket  as  descriptor  number  0.  Tbe  accessor  of  tbe  portal  will  create  a  socket  in  the  same  domain 
and  eonneet  to  the  server.  The  user  will  then  wrap  the  socket  in  the  specified  protocol  to  create 
an  object  of  the  required  descriptor  type  dtype  and  proceed  with  the  operation  which  was  in  pro¬ 
gress  before  the  portal  was  encountered . 

While  the  server  process  holds  the  socket  (which  it  received  as  fd  from  the  portal  call  on 
descriptor  0  at  activation)  furth"  n.erences  will  result  in  connections  being  made  to  the  same 
socket. 


2  2.S.5.  File,  device,  and  portal  removal 

A  reference  to  a  file  special  device  or  portal  may  be  removed  with  the  «a link  call, 

unlink  (path), 
char  ‘path; 

The  caller  must  have  write  access  to  the  directory  in  which  the  file  is  located  for  this  call  to  be 
successful. 

2.2.4.  Reading  and  modifying  file  attributes 

Detailed  information  about  tbe  attributes  of  a  file  may  be  obtained  with  tbe  calls: 


t  Tk<  portal  ea 11  it  act  impltmtnud  is  4. USD 


—  December  1985  — 


Fin*]  Report 


20  - 


File  system 


#  include  <sys/stat.h> 
stat(pith,  Btb); 

char  •path,  rault  struct  rUt  *»tb, 
fstat(fd,  stb), 

int  fd;  remit  struct  stat  *stb, 

The  *ttt  structure  includes  the  file  type,  protection,  ownership,  access  times,  site,  and  a  count  of 
kard  links.  If  the  file  is  a  symbolic  link,  then  the  status  of  the  link  itself  (rather  than  the  file  the 
link  references)  may  be  found  using  the  Uut  call: 

kt*t(path,  stb); 

char  *path,  result  struct  stat  *stb; 

Newly  created  files  are  assigned  the  user  id  of  the  process  that  created  it  and  the  group  id  of 
the  directory  in  which  it  was  created.  The  ownership  of  a  file  may  be  changed  by  either  of  the 
calls 

chown(path,  owner,  group); 
char  *path;  int  owner,  group; 

fchown(fd,  owner,  group); 
int  fd,  owner,  group; 

In  addition  to  ownership,  each  file  has  three  levels  of  access  protection  associated  with  it. 
These  levels  are  owner  relative,  group  relative,  and  global  (all  users  and  groups).  Each  level  of 
access  hss  separste  indicators  for  read  permission,  write  permission,  and  execute  permission.  The 
protection  bits  associated  with  a  file  may  be  set  by  either  of  the  calls. 

chmod(patb,  mode); 
char  *path;  int  mode; 

fchmod(fd,  mode); 
int  fd,  mode; 

where  mode  is  a  value  indicating  the  new  protection  of  the  file.  The  file  mode  w  a  three  digit 
octal  number.  Each  digit  encodes  read  access  as  4,  write  access  as  2  and  execute  access  as  1, 
or’ed  together.  The  0700  bite  describe  owner  access,  the  070  bite  describe  the  access  rights  for 
processes  in  the  same  group  as  the  file,  and  the  07  bite  describe  the  access  rights  for  other 
processes. 

Finally,  the  access  and  modify  times  on  a  fib.  may  be  set  by  the  call: 

atimes(path,  tvp) 

char  ‘path;  struct  timeval  *tvp|2j; 

This  is  particularly  useful  when  moving  files  between  medis,  to  preserve  relationships  between  the 
times  the  file  was  modified. 

1 .2.6.  Links  and  renaming 

Links  allow  multiple  names  for  a  file  to  exist.  Links  exist  independently  of  the  file  linked  to. 

Two  types  of  links  exist,  ktri  links  and  tymiolic  links.  A  hard  linl  s  a  reference  counting 
mechanism  that  allows  a  file  to  have  multiple  names  within  the  same  file  system.  Symbolic  links 
cause  string  substitution  during  the  pathname  interpretation  process. 

Hard  links  and  symbolic  links  have  different  properties  A  hard  link  insures  the  target  file 
will  always  be  accessible,  even  after  its  original  directory  entry  is  removed;  no  such  guarantee 
exists  for  a  symbolic  link.  Symbolic  links  can  span  file  systems  boundaries. 
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accessible  -=  access(p*th,  bow); 
result  int  accessible;  char  *patb;  int  bow; 

Here  kow  is  constructed  by  or’ing  the  following  bits,  defined  in  <sys/file.h>: 

#dsfine  F_OK  0  /•  file  exists  */ 

^define  X_OK  1  /*  file  is  executable  */ 

^define  W_OK  2  /•  file  is  writable  •/ 

#define  R_OK  4  /*  file  is  readable  */ 

The  presence  or  absence  of  advisory  locks  does  not  affect  the  result  of  access  . 

2.2.8.  Locking 

The  file  system  provides  basic  facilities  that  allow  cooperating  processes  to  synchronise  their 
access  to  shared  files.  A  process  may  place  an  advisory  reed  or  write  lock  on  a  file,  so  that  other 
cooperating  processes  may  avoid  interfering  with  the  process’  access.  This  simple  mechanism  pro¬ 
vides  locking  with  file  granularity.  More  granular  locking  can  be  built  using  the  IPC  facilities  to 
provide  a  lock  manager.  The  system  does  not  force  processes  to  obey  the  locks;  they  are  of  an 
advisory  nature  only. 

Locking  is  performed  after  an  open  call  by  applying  the  flock  primitive, 

flock(fd,  how); 
int  fd,  how; 

where  the  kow  parameter  is  formed  from  bits  defined  in  <sys/file.h>: 

#define  LOCK_SH  1  /*  shared  lock  •/ 

#define  LOCKJEX  2  /*  exclusive  lock  •/ 

#  define  LOCK_NB  4  /•  don’t  block  when  locking  •/ 

#define  LOCK_UN  8  /•  unlock  */ 

Successive  lock  calls  may  be  used  to  increase  or  decrease  the  level  of  locking.  If  an  object  is 
currently  locked  by  another  process  when  a  flock  call  is  made,  the  caller  will  be  blocked  until  the 
current  lock  owner  releases  the  lock;  this  may  be  avoided  by  including  LOCK_NB  in  the  kow 
parameter.  Specifying  LOCK_UN  removes  all  locks  associated  with  the  descriptor.  Advisory 
locks  held  by  a  process  are  automatically  deleted  when  the  process  terminates. 

2.2.0.  Disk  quotas 

As  an  optional  facility,  each  file  system  may  be  requested  to  impose  limits  on  a  user’s  disk 
usage.  Two  quantities  are  limited:  the  total  amount  of  disk  space  which  a  user  may  allocate  in  a 
file  system  and  the  total  number  of  files  a  user  may  create  in  a  file  system.  Quotas  are  expressed 
as  kari  limits  and  toft  limits.  A  hard  limit  is  always  imposed;  if  a  user  would  exceed  a  hard  limit, 
the  operation  which  caused  the  resource  request  will  fail.  A  soft  limit  results  in  the  user  receiving 
a  warning  message,  but  with  allocation  succeeding.  Facilities  are  provided  to  turn  soft  limits  into 
bard  limits  if  a  user  has  exceeded  a  soft  limit  for  an  unreasonable  period  of  time. 

To  enable  disk  quotas  ou  a  file  system  the  tetquota  call  is  used: 

setquota( special,  file) 
char  ^special,  ‘file, 

where  tpccial  refers  to  a  structured  device  file  where  a  mounted  file  system  exists,  and  file  refers 
to  a  disk  quota  file  (residing  on  the  file  system  associated  with  special )  from  which  user  quotas 
should  be  obtained.  The  format  of  the  disk  quota  file  is  implementation  dependent 

To  manipulate  disk  quotas  the  quote  call  is  provided: 
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#include  <sys/quota.h> 

quota(emd,  uid,  arg,  ad  dr) 
int  cmd,  uid,  arg;  caddr_t  addr, 

The  indicated  emd  is  applied  to  the  user  ID  uid.  The  parameters  arg  and  addr  are  command 
specific.  The  file  <sys/quota.h>  contains  definitions  pertinent  to  the  use  of  this  call. 


14.  Interprocess  communications 

*4.1.  Interprocess  communication  primitives 

24.1.1.  Communication  domains 

The  system  provides  access  to  an  extensible  set  of  communication  domains.  A  communica¬ 
tion  domain  is  identified  by  a  manifest  constant  defined  in  the  file  <sys/aocket.h>.  Important 
standard  domains  supported  by  the  system  are  the  "unix”  domain,  AF_UNDC,  for  communication 
within  the  system,  and  the  “internet”  domain  for  communication  in  the  DARPA  internet, 
AF_INET.  Other  domains  can  be  added  to  the  system. 

24.1.2.  Socket  types  and  protocols 

Within  a  domain,  communication  takes  place  between  communication  endpoints  known  as 
aocktls.  Each  socket  has  the  potential  to  exchange  information  with  other  sockets  within  the 
domain. 

Each  socket  has  an  associated  abstract  type,  which  describes  the  semantics  of  communica¬ 
tion  using  that  socket.  Properties  such  as  reliability,  ordering,  and  prevention  of  duplication  of 
messages  are  determined  by  the  type.  The  basic  set  of  socket  types  is  defined  in 
<sys/socket.h>: 

/•  Standard  socket  types  •/ 

^define  SOCKJXJRAM  1  /•  datagram  •/ 

#define  SOCK_STREAM  2  /*  virtual  circuit  •  / 

^define  SOCKJRAW  3  /*  raw  socket  •/ 

#define  SOCK_RDM  4  /•  reliably-delivered  message  •/ 

^define  SOCK_SEQPACKET  5  /*  sequenced  packets  */ 

The  SOCK_DGRAM  type  models  the  semantics  of  datagrams  in  network  communication:  mes¬ 
sages  may  be  lost  or  duplicated  and  may  arrive  out-of-order.  The  SOCK _RDM  type  models  the 
semantics  of  reliable  datagrams:  messages  arrive  unduplicated  and  in-order,  the  sender  is  notified 
if  messages  are  lost.  The  send  and  receive  operations  (described  below)  generate 
rcliable/unreliable  datagrams.  The  SOCK_STREAM  type  models  connection-based  virtual  cir¬ 
cuits:  two-way  byte  streams  with  no  record  boundaries.  The  SOCK_SEQPACKET  type  models  a 
connection-based,  full-duplex,  reliable,  sequenced  packet  exchange;  the  sender  u*  notified  if  mes¬ 
sages  are  lost,  and  messages  are  never  duplicated  or  presented  out-of-order.  Users  of  the  last  two 
abstractions  may  use  the  facilities  for  out-of-band  transmission  to  send  out-of-band  data. 

SOCK_RAW  is  used  for  unprocessed  access  to  internal  network  layers  and  interfaces;  it  has 
no  specific  semantics. 

Other  socket  types  can  be  defined. f 


t  4JBSD  doe*  sot  support  the  SOCK_RDM  sod  SOCK_SEQPACKET  types. 
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Here  msg_name  and  mtg^namelen  specify  the  source  or  destination  address  if  the  socket  is  uncon¬ 
nected;  nug  name  may  be  given  as  a  null  jointer  if  no  names  are  desired  or  required.  The 
and  m4g_tovlcn  describe  the  scatter/gather  locations,  as  described  in  section  2  1.3. 
Access  rights  to  be  sent  along  with  the  message  are  specified  in  m»g_aeerighu,  which  has  length 
msg  tetnghUUn.  In  the  "unix‘  domain  these  are  an  array  of  integer  descriptors,  taken  from  the 
•ending  process  and  duplicated  in  the  receiver 

This  structure  is  used  in  the  operations  otndmtg  and  retvmtg 
sendmsg(s,  msg,  fags); 

int  s;  struct  msghdr  *msg;  int  flags;  • 


msglen  -=  recvmsg(s,  msg,  flags); 

result  int  msglen;  int  s;  result  struct  msghdr  *msg;  int  flags; 


2 -3. 1.8.  Using  read  and  write  vith  sockets 

The  normal  UNIX  retd  and  writt  calls  may  be  applied  to  connected  sockets  and  translated 
into  tend  and  rtttivt  calls  from  or  to  a  single  area  of  memory  and  discarding  any  righto  received 
A  process  may  operate  on  a  virtual  circuit  socket,  a  terminal  or  a  file  with  blocking  or  non- 
blocking  input/output  operations  without  distinguishing  the  descriptor  type. 

2 .3.1.9.  Shutting  down  halves  of  full-duplex  connections 

A  process  that  has  a  full-duplex  socket  such  as  a  virtual  circuit  and  no  longer  wishes  to  read 
from  or  write  to  this  socket  can  give  the  call: 

shutdowns,  direction); 
int  s,  direction; 

where  direction  is  0  to  not  read  further,  1  to  not  write  further,  or  2  to  completely  shut  the  con¬ 
nection  down. 


A'3 


2.3.1.10.  Socket  and  protocol  options 

Sockets,  and  thtir  underlying  communication  protocols,  may  support  optiont.  These  options 
may  be  used  to  manipulate  implementation  specific  or  non-standard  facilities.  The  gcUockopt  and 
oetaoekopt  calls  are  used  to  control  options: 

getoockopt(s,  level,  optname,  optval,  optlen) 

int  s,  level,  optname;  result  caddr_t  optval;  result  int  *optlen; 

setoockopt(s,  level,  optname,  optval,  optlen) 
int  s,  level,  optname;  caddr_t  optval;  int  optlen; 

The  option  optname  is  interpreted  at  the  indicated  protocol  level  for  socket  a.  If  a  value  is 
specified  with  optval  and  optlen,  it  is  interpreted  by  the  software  operating  at  the  specified  level. 
The  level  SOL_SOCKET  is  reserved  to  indicate  options  maintained  by  the  socket  facilities. 
Other  level  values  indicate  a  particular  protocol  which  is  to  act  on  the  option  request;  these  values 
are  normally  interpreted  as  a  “protocol  number”. 


?- 

& 
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S3.2.  UNIX  domain 

This  section  describes  briefly  the  properties  of  the  UNIX  communications  domain. 

14.2.1.  Types  of  sockets 

In  the  UNIX  domain,  the  SOCK_STREAM  abstraction  provides  pipe-like  facilities,  while 
SOCKJXJRAM  provides  (usually)  reliable  message-style  communications. 
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2.4. 1.1.1.  Input  modes 

A  terminal  is  in  one  of  three  possible  modes:  rsu>,  tbrtak,  or  cooked  In  raw  mode  all  input 
is  passed  through  to  the  reading  process  immediately  and  without  interpretation.  In  cbreak  mode, 
the  handler  interprets  input  only  by  looking  for  characters  that  cause  interrupts  or  output  flow 
control;  sdl  other  characters  are  made  available  as  in  raw  mode.  In  cooked  mode,  input  is  pro¬ 
cessed  to  provide  standard  line-oriented  local  editing  functions,  and  input  is  presents  on  a  line- 
by-line  basis. 

2.4. 1.1.2.  Interrupt  characters 

,  Interrupt  character?  are  interpreted  by  the  terminal  handler  only  in  ebreak  and  cooked 
modes,  and  cause  a  software  interrupt  to  be  sent  to  all  processes  in  the  process  group  associated 
with  the  terminal.  Interrupt  characters  exist  to  send  S1GINT  and  S1GQUIT  signals,  and  to  stop  a 
process  group  with  the  S1GTSTP  signal  either  immediately,  or  when  all  input  up  to  the  stop  char¬ 
acter  has  been  read. 

2.4. 1.1.3.  Line  editing 

When  the  terminal  is  in  cooked  mode,  editing  of  an  input  line  is  performed.  Editing  facili¬ 
ties  allow  deletion  of  the  previous  character  or  word,  or  deletion  of  the  current  input  line.  In  addi¬ 
tion,  a  special  character  may  be  used  to  reprint  the  current  input  line  after  some  number  of  edit- 
ing  operations  have  been  applied. 

Certain  other  characters  are  interpreted  specially  when  a  process  is  in  cooked  mode.  The 
««/  of  line  character  determines  the  snd  of  an  input  record.  The  ond  of  file  character  simulates 
an  end  of  file  occurrence  on  terminal  input.  Flow  control  is  provided  by  slop  output  and  star!  out¬ 
put  control  characters.  Output  may  be  flushed  with  the  flueh  output  character;  and  a  /tiers/  char¬ 
acter  may  be  used  to  force  literal  input  of  the  immediately  following  character  in  the  input  line. 

• 

2.4. 1.2.  Terminal  output 

On  output,  the  terminal  handler  provides  some  oimple  formatting  services.  These  include 
converting  the  carriage  return  character  to  the  two  character  return-linefeed  sequence,  displaying 
non-graphic  ASCII  characters  as  “'character”,  inserting  delays  after  certain  standard  control 
characters,  expanding  tabs,  and  providing  translations  for  upper-case  only  terminals. 

2.4. 1.8.  Terminal  control  operations 

When  a  terminal  is  first  opened  it  is  initialised  to  a  standard  state  and  configured  wit!'  a  set 
of  standard  control,  editing,  and  interrupt  characters.  A  process  may  alter  this  configuration  with 
certain  control  operations,  specifying  parameters  in  a  standard  structure: 

struct  ttymode  { 


short 

ttjispeed; 

/•  input  speed  */ 

int 

ttjflags; 

/*  input  flags  */ 

short 

tt_ospeed; 

/•  output  speed  */ 

int 

tt_oflags, 

/•  output  flags  •/ 

}; 

and  “special  characters”  are  specified  with  the  ttpekore  structure, 
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II.  Human/Machine  Interaction  and  Expert  Database  Systems 

8.  Research  in  Human/Machine  Interaction 
8.1.  Computer  Graphics 

Many  applications  of  computer-aided  geometric  design  rtqjire  the  description  of  objects 
using  mathematical  functions  called  splines.  A  spline  curve  is  a  piecewise  univariate  function  that 
satisfies  a  set  of  continuity  constraints  where  the  curve  segments  meet.  The  point  at  which  two 
segments  join  is  called  a  joint.  A  popular  type  of  spline  is  the  polynomial  spline,  defined  by  a  set 
of  control  vertices  and  a  set  of  polynomial  functions  called  basis  Junctions  that  are  used  to  blend, 
or  weight,  the  vertices. 

Splines  are  either  interpolating  or  approximating.  Interpolating  splines  are  required  to  pass 
through  the  control  vertices,  while  approximating  splines  are  only  required  to  pass  “near”  the  ver¬ 
tices.  Splines  can  be  further  classified  as  either  global,  or  local  representations.  In  a  global 
representation,  the  movement  of  a  control  vertex  causes  the  entire  spline  to  change.  In  a  local 
representation,  it  is  possible  to  localise  the  change  resulting  from  the  perturbation  of  a  control 
vertex;  this  is  the  property  of  local  control.  Barsky’s  development  of  the  Bcta-splinc\l]  has  shown 
that  it  is  possible  to  introduce  shape  parameters  into  the  curve  formulation,  which  can  be  used  to 
modify  the  shape  of  the  curve  independent  of  the  control  vertices.  Experience  has  shown  that 
shape  parameters  provide  a  designer  with  intuitive  control  of  shape. 

From  the  standpoint  of  computer-aided  geometric  design,  it  is  desirable  to  construct  local, 
polynomial  splines  with  shape  parameters.  Since  the  choice  of  interpolation  versus  approximation 
is  application  dependent,  both  should  be  possible.  By  combining  the  work  of  Catmull  and  Rom[2] 
with  that  of  Barsky,  a  class  of  splines  can  be  developed  possessing  shape  parameters  that  are 
local,  polynomial,  and  either  interpolating  or  approximating. 

Catmull  and  Rom  introduced  a  class  of  local  polynomial  splines  which  could  be  made  to 
either  interpolate  or  approximate  a  set  of  control  vertices.1  To  construct  a  class  of  splines  with  the 
properties  enumerated  above,  we  need  only  introduce  shape  parameters  into  the  Catmull-Rom 
splines.  As  with  Beta-splines,  this  is  done  by  replacing  algebraic  continuity  with  geometric  con¬ 
tinuity. 

Algebraic  continuity  refers  to  the  continuity  of  parametric  derivative  vectors  of  the  curve.  A 
continuous  first  derivative  vector  gives  first  order  algebraie,  or  C1  continuity.  If  both  the  first 
and  second  derivative  vectors  are  continuous,  the  spline  has  seeond  order  algebraie  (C3)  con¬ 
tinuity.  Geometric  continuity,  on  the  other  hand,  requires  continuity  of  visual  quantities  such  as 
nnit  tangent  and  curvature  vectors.  A  continuous  unit  tangent  vector  gives  first  order  geometric 
(G1)  continuity,  while  second  order  geometric  (G3)  continuity  refers  to  continuous  unit  tangent 
and  curvature  vectors. 

It  had  previously  been  shown  that  Cl  continuity  may  be  replaced  with  G1,  and  C3  may  be 
replaced  with  G3  while  still  maintaining  visual  smoothness.  Since  geometric  continuity  is  less  res¬ 
trictive  than  the  corresponding  order  of  algebraic  continuity,  the  relaxation  from  algebraic  to 
geometric  continuity  allows  the  introduction  of  new  degrees  of  freedom  called  shape  parameters. 
The  replacement  of  C1  with  G1  results  in  one  shape  parameter;  replacing  C3  with  G3  results  in 
two  shape  parameters. 

We  have  shown[3]  how  the  relaxation  to  geometric  continuity  can  yield  a  class  of  Catmull- 
Rom  splines,  either  interpolating  or  approximating,  whose  shape  can  be  modified  via  shape 
parameters.  The  interpolating  splines  we  present  are  new  due  to  their  shape  parameters;  they  are 

1  Uafortonateljr,  the  title  ot  tbeir  paper  did  not  reject  tbe  fact  that  both  approximating  tad  LaUrpolatiig  ipHeet 
are  member*  of  tbe  dau 
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the  first  local,  polynomial,  interpolating  splines  with  locally  variable  shape  parameters.  Conse¬ 
quently,  local  modification  of  a  shape  parameter  affect*  only  a  portion  of  the  curve  new  the 
corresponding  joint. 

1-2.  Intelligent  Systems  m 

SJ.L  The  UNDCf  Consultant 

We  have  been  engaged  in  the  construction  of  a  system  called  UC,  for  UNIX  Coruultant.  UC 
*  designed  to  be  an  automated  consultant  that  converses  in  natural  language  with  naive  users  to 
help  answer  their  questions  about  the  UNIX  operating  system.  The  intent  is  to  provide  a  system 
that  allows  a  naive  user  to  ask  questions  about  terminology,  about  command  names  and  formats, 
for  information  concerning  plans  for  aoing  things  in  UNIX,  and  for  assistance  in  debugging  prob¬ 
lems  with  UNIX  commands.  UC  should  respond  by  explaining  terminology,  providing  command 
names  and  describing  their  format,  filling  in  the  details  of  a  user  plan,  suggesting  plans  to  achieve 
goals,  or  engaging  a  user  in  a  dia’ogue  by  requesting  more  information. 

To  achieve  this  goal  pequ  res  research  into  basic  issues  in  natural  language  processing  and 
common  sense  reasoning  Our  research  views  the  user  as  a  planning  agent  who  has  goals  impli¬ 
citly  or  explicitly  expressed  by  his  or  her  utterance.  It  is  UC’s  task  to  determine  the  user’s  goal 
and  aid  the  user  by  providing  information  for  the  achievement  of  those  goals. 

In  our  previous  work,  we  had  implemented  a  prototype  version  of  this  system.  Much  of  our 
research  work  has  addressed  shortcomings  with  the  technology  that  limited  the  success  of  our  ini¬ 
tial  efforts. 

Probably  the  most  fundamental  problem  is  one  of  knowledge  representation.  Weaknesses  in 
the  knowledge  representation  scheme  we  were  employing,  and,  indeed,  in  knowledg  --presenta¬ 
tion  schemes  in  general,  prevented  our  system  from  having  the  flexibility  and  extensibility  we 
sought.  To  rectify  this  problem,  we  developed  a  new  knowledge  representation  scheme  called 
KODIAK.  KODIAK  is  described  in  Wilensky[4].  Some  important  characteristics  of  KODIAK 
are:  It  clarifies  the  semantics  of  alots;  it  is  uniform,  and  applies  to  any  number  of  semantic 
domains;  it  is  based  on  relations,  and  in  particular,  has  a  small  set  of  primitive  epistemological 
relations  and  allows  for  the  creation  and  definition  of  new  relations,  and  it  has  a  canonical  form. 

Perhaps  the  most  interesting  aspect  of  KODIAK  is  the  introduction  of  “non-truth  condi¬ 
tional”  representation  entities.  We  call  such  entities  view.  The  idea  behind  views  is  that  one 
concept  can  be  thought  of  in  terms  of  another.  This  viewing  of  another  concept  creates  a  new 
concept.  It  appears  as  if  the  introduction  of  such  entities  solves  a  number  of  long  standing 
representation  \1  issue.*. 

For  example,  one  problem  is  that  most  systems  provide  one  category  called  Paraon,  and 
other  called  Physical-Object,  and  presume  that  these  are  in  fact  distinct.  This  is  a  problem 
because  in  many  case,  we  want  people  to  inherit  propertia  of  physical  objects,  even  though  they 
are  not  properly  classified  as  such.  We  address  this  issue  in  KODIAK  by  asserting  that  Person  is 
a  kind  of  Living-Thing,  but,  in  addition,  we  also  assert  that  it  is  possible  to  VIEW  a  Person 
as  a  Physical-Object.  Moreover,  the  VIEW  of  person  as  a  Physical-Object  is  itself  another 
concept.  Namely,  it  is  the  concept  Body.  Thus  we  can  allow  Person,  to  ahare  some  of  the  pro¬ 
pertia  of  Physical-Object  without  it  properly  being  represented  as  one.  The  importance  of 
views  is  that  the  allow  the  flexibility  of  viewing  a  (possible  defined)  concept  as  something  other 
than  its  “ordinary"  interpretation  We  believe  that  views  constitutes  a  major  finding  in  the  arer 
of  knowledge  representation. 

Views  appear  to  be  particularly  important  in  representing  knowledge  about  language 
Jacobs[5)  points  out  that  many  otherwise  unstatable  linguistic  regularities  can  be  raptured  using 
views.  For  example,  there  are  many  eases  like  “John  took  a  punch  from  Bill”,  “Bill  gave  John  a 
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punch”,  and  “John  gave  Mary  *  hi «”  in  which  it  appears  that  a  person  being  acted  upon  can  be 

talked  about  in  terms  of  giving  and  taking.  Our  solution  is  to  represent  a  “being  acted  upon  as 

*‘r*nJ^er  view,  for  the  purposes  of  linguistic  expression.  That  is,  an  object  being  acted  upon  can 

be  viewed  as  transferring  ihe  action  to  that  object.  Once  this  view  is  represented,  knowledge 

about  the  linguistic  expression  of  transferring  (e  g.,  that  certain  transfers  can  be  expressed  using 

the  words  “give”  and  “take”)  can  be  employed  to  express  certain  kinds  of  non- transferring  \ 

actions  that  act  upon  some  object. 
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To  delete  the  file  named  dav,  type  rm  dave. 

User:  How  can  I  add  read  permission  to  a  file? 

To  add  read  permission  to  a  file,  use  chmod. 

For  example,  to  add  individual  read  permission  to  the  file  named  • 

charlie,  type  chmod  u+r  charlie. 

User  How  can  I  add  the  owmer  execute  permission  to  a  file? 

To  add  individual  execute  permission  to  a  file,  use  chmod. 

For  example,  '  >  add  individual  execute  permission  to  the  file 
named  marc,  type  chmod  u+x  marc. 

User:  How  can  I  change  the  group  execute  permission  of  a  file? 

To  change  the  group  execute  permission  of  a  file,  use  chmod. 

For  example,  to  remove  group  execute  permission  from  the  file 
named  jim,  type  chmod  g-x  jim. 

User:  Hew  can  I  *hangc  the  group  read  permission  of  a  file? 

To  change  the  gro  ip  read  permission  of  a  file,  use  chmod. 

For  example,  to  remove  group  read  permission  from  the  file  named 
george,  type  chmod  g-r  george. 

User:  How  can  I  remove  the  group  write  permission  from  the  file  named  bimbo? 

To  remove  group  write  permission  from  the  file  named  bimbo,  type 
chmod  g-w  bimbo. 

User:  How  do  I  change  the  protection  of  a  file? 

To  change  the  permission  of  a  file,  use  chmod. 

For  example,  to  remove  individual  read  permission  from  the  file 
named  paul,  type  chmod  u-r  paul. 

User:  How  can  I  compact  a  file? 

To  compact  a  file,  use  compact. 

For  example,  to  compact  the  file  named  peter,  type  compact  peter. 

User:  How  can  I  create  a  direct/--,? 

To  create  a  directory,  use  mkdu. 

For  example,  to  create  the  dire  tory  named  paul,  type  mkdir  paul. 

User:  How  can  I  find  a  file? 

To  find  a  file,  use  whereis. 

For  example,  to  find  the  file  named  charlie,  type  whereis  charlie. 

User:  How  can  I  find  out  how  much  disk  space  I  am  using? 

To  find  out  how  much  disk  space  you  are  using,  type  du. 

User:  How  can  I  find  out  who  is  using  the  system? 

To  find  out  all  the  users  on  the  system,  type  finger. 

User:  How  can  1  find  out  wilensky’s  phone  number? 

To  find  out  Wilensky’s  phone  number,  type  finger  wilensky 

We  have  also  been  developing  additional  kinds  of  planning  mechanisms  to  support  the  reasoning 
component  of  our  system.  In  particular,  Richard  Alterman  has  been  developing  a  technique  called 
tdaptive  planning {6], 
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Tie  aim  of  adaptive  planning  is  to  borrow  plans  from  related  *»■''»  and  refit  them  to  meet 
the  demands  of  the  current  planning  situation.  For  example,  suppose  UC  has  a  plan  for  printing 
a  file  on  the  image n  using  the  UNIX  ‘Ipr’  command,  by  appending  the  argument  ‘-Pip*.  An  adap¬ 
tive  planner  can  re-use  this  plan  in  order  to  construct  a  plan  for  deleting  a  file  from  the  imagen 
queue  using  the  Tprm’  command  by  appending  the  argument  '-Pip*. 

A1  term  an  suggests  several  distinguishing  features  of  adaptive  planning.  The  first  of  these  is 
that  adaptive  planning  makes  the  background  knowledge  associated  with  an  old  plan  explicit.  By 
making  the  content  and  organisation  of  the  background  knowledge  explicit,  it  becomes  possible  to 
re-interpret  the  plan  for  a  wider  variety  of  situations.  A  second  important  feature  of  adaptive 
planning  is  that  it  plans  by  situation  matching.  Rather  than  treating  the  old  plan  as  a  partial 
solution  which  is  modified  using  weak  (problem  solving)  methods,  the  old  plan  is  used  as  a  start¬ 
ing  point  from  which  the  old  and  new  situation  are  matched.  In  the  course  of  the  matching  a  new 
plan  is  produced.  A  third  important  feature  of  adaptive  planning  is  that  it  can  re-use  old  plans  in 
circumstances  where  the  planner  does  not  have  access  to  a  general  plan.  As  shown  in  the  example 
above,  adaptive  planning  is  capable  of  taking  a  specific  plan  for  accomplishing  a  specific  set  of 
goals,  and  refit  it  to  meet  the  demands  of  some  new  situation. 

A  prototype  model  of  an  adaptive  planner  is  now  under  construction. 


8.2.2.  Syllogistic  Reasoning  in  Fussy  Logie 

Fuzzy  logic  may  be  viewed  as  a  generalizatr  n  of  multivalued  logic  in  that  it  provides  a 
wider  range  of  tools  for  dealing  with  uncertainty  and  imprecision  in  knowledge  representation, 
inference  and  decision  analysis.  In  particular,  fuzzy  logic  allows  the  use  of  (a)  fuzzy  predicates 
exemplified  by  tmall,  young,  niee,  etc;  (b)  fuzzy  quantifiers  exemplified  by  most,  severs/,  many, 
few,  many  more,  etc;  (c)  fuzzy  truth-values  exemplified  by  quite  true,  very  true,  moitiy  false,  etc; 
(d)  fuzzy  probabilities  exemplified  by  likely,  unlikely,  not  very  likely,  etc;  (e)  fuzzy  possibilities 
exemplified  by  quite  possible,  almost  impossible,  etc;  and  (f)  predicate  modifiers  exemplified  by 
very,  more  or  less,  quite,  extremely,  etc. 

What  matters  most  about  fuzzy  logic  is  its  ability  to  deal  with  fuzzy  quantifiers  as  fuzzy 
numbers  which  may  be  manipulated  through  the  use  of  fuzzy  arithmetic[7].  This  ability  depends 
in  an  essential  way  on  the  existence  -  within  fuzzy  logic  -  of  the  concept  of  cardinality  or,  more 
generally,  the  concept  of  measure  of  a  fuzzy  set.  Thus,  if  one  accepts  the  classical  view  of  Kolmo- 
goroff  that  probability  theory  is  a  branch  of  measure  theory,  then,  more  generally,  the  theory  of 
fuzzy  probabilities  may  be  subsumed  within  fuzzy  logic.  This  aspect  of  fuzzy  logic  makes  it  par- 
ticularly  well-suited  for  the  management  of  uncertainty  in  expert  systems[8].  More  specifically,  by 
employing  a  single  framework  for  the  analysis  of  both  probabilistic  and  possibilistic  uncertainties, 
fuzzy  logic  provides  a  systematic  basis  for  inference  from  premises  which  are  imprecise,  incom¬ 
plete  or  not  totally  reliable.  In  this  way,  it  becomes  possible  —  as  we  have  shown  —  to  derive  a 
set  of  rules  for  combining  evidence  through  conjunction,  disjunction  and  chaining.  In  effect,  such 
.rules  may  be  viewed  as  instances  of  syllogistic  reasoning  in  fuzzy  logic;  however,  unlike  the  rules 
e  mployed  in  most  of  the  existing  expert  systems,  they  are  not  s d  hoe  in  nature. 

A  fuzzy  syllogism  in  fuzzy  logic  is  defined  to  be  an  inference  schema  in  which  the  major 
premise,  the  zninor  premise  and  the  conclusion  are  propositions  containing  fuzzy  quantifiers.  A 
basic  fuzzy  syllogism  in  fuzzy  logic  is  the  interseetion/produet  syllogism 

Q  j  A  '  s  are  B  '  $ 


(A  and  B)'  s  nre  C 


(Ci  dprod  Q3)  A  '  s  are  ( B  and  C)'  s  , 

in  which  A  ,  B  *»nd  C  are  fuzzy  predicates  (e.g.,  young  men,  blonde  women,  etc.);  <?,  and  Qz  are 
fuzzy  quantifiers  (e.g.,  most,  many,  almost  all,  etc.)  which  are  interpreted  as  fuzzy  numbers;  and 
Qi  dprod  Q3  is  the  product  of  Qi  and  Q3  in  fuzzy  arithmetic. 
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We  have  developed  several  other  basic  syllogisms  which  may  be  employed  as  rules  of  combi* 
nation  of  evidence  in  expert  systems.  Among  these  is  the  consequent  conjunction  tpllogiim  which 
may  be  expressed  as  the  inference  schema 

Q j  A  1  «  sre  B  '  t  » 


Q  A  1  t  sre  (B  and  C) '  t  , 

in  which  Q  is  a  fussy  number  bounded  from  above  by  circle  Q ,  and  from  below  by  Oorsign 
(Q  l  Qi  circle  1),  where  itum  ,  circle  and  circle  are  the  extensions  of  he  arithmetic 

operators  +,  -  and  ondtign  ,  respectively,  to  fussy  operands.  Furthermore,  we  had  shown  that 
sy Logistic  reasoning  in  fussy  logic  provides  a  basis  for  reasoning  with  dispositions,  that  is,  with 
propositions  which  are  preponderantly,  but  not  necessarily  always,  true. 

This  work  may  be  viewed  as  an  initiation  of  a  study  of  syllogistic  reasoning  in  the  context 
of  fussy  logic.  Such  reasoning  has  a  direct  bearing  on  the  rules  of  combination  of  evidence  in 
expen  systems  and,  in  addition,  provides  a  basis  for  inference  from  commonsenae  knowledge  by 
viewing  such  knowledge  as  a  collection  of  dispositions. 


tA.  Software  Development  Systems 


8.3.1.  Relational  Views  of  Programs 

Making  a  change  to  a  software  system  requires  an  understanding  of  how  the  part  being 
changed  fits  into  the  system.  A  major  part  of  understanding  is  simply  seeing  the  information 
relevant  to  what  one  is  trying  to  understand. 

The  OMEGA  programming  system [9]  provides  mechanisms  for  seeing  and  manipulating 
software  in  a  much  more  powerful  and  general  way  than  current  systems.  Instead  of  a  linear 
view,  such  as  presented  by  UNIX  or  a  hierarchical  view,  such  a  pnsented  by  Gandalf[lO], 
OMEGA  provides  multiple  relational  views  of  the  information  in  a  program. 

The  relational  model  provides  very  powerful  operations  for  describing  portions  of  a  database 
of  information.  OMEGA  gives  programmers  the  opportunity  to  view  and  change  a  wide  variety 
of  cross-sections  of  a  software  system. 

We  have  begun  building  a  prototype  implementation  of  OMEGA  using  the  relational  data¬ 
base  system  of  INGRES[ll].  So  far,  we  have  implemented  a  program  to  extract  and  store  the 
information  in  traditional  program  text  into  an  INGRES  database,  and  a  simple  pointing- 
oriented  user  interface  for  browsing  programs  in  the  database. 

Configurations,  versions,  call  graphs,  and  slices  are  all  examples  of  views,  or  cross-sections  of 
programs.  To  provide  a  powerful  mechanism  for  defining,  retrieving,  and  updating  these  views, 
the  OMEGA  programming  system  uses  a  relational  database  system  to  manage  all  program  infor- 
mat  ion. 

We  have  built  a  prototype  implementation  of  the  OMEGA  -  database  system  interface. 
This  implementation  includes  the  design  of  a  relational  schema  for  a  Pascal-like  language,  a  pro¬ 
gram  for  taking  software  stored  as  text  and  translating  it  into  the  database  representation,  and  a 
simple  pointing-oriented  user  interface.  Initial  performance  measurements  indicate  that  response 
is  too  slow  in  our  current  environment,  but  that  advances  in  database  software  technology  and 
hardware  should  make  response  fast  enough  in  the  near  future. 

Our  initial  measurements  of  performance  show  that  compiled  queries  and  buffering  improve 
performance  significantly  In  general,  the  database  system  should  be  able  to  use  main  memory 
and  more  semantic  information  about  the  data  to  provide  substantially  better  performance  than  is 
currently  available. 


—  December  1985  — 


Final  Report 


Software  Development  liy  stems 


SB  - 

11.2.  A  Graph  Browser 

The  many  uses  of  graphs  in  computer  science  provide  a  wide  application  base  for  a  general 
purpose  graph  browser.  An  early  prototype  of  such  a  graph  browser  has  shown  that 

•  graphs  should  be  presented  in  the  usual  fashion 

•  arcs  and  arc  labels  should  be  drawn 

•  convenient  browsing  and  editing  operations  are  required 

Taking  these  features  into  account,  we  have  proposed  a  design  fc.  GRAB[12],  a  general  pur^ 
poae  graph  browser.  GRAB  will  allow  users  to  browse  graphs  stored  in  a  relational  database.  It 
will  use  a  uaei^friendly  window-mouse  paradigm.  The  combination  of  a  graphical  presentation  of 
a  directed  graph  and  the  mouse  as  a  pointing  device  will  allow  users  to  browse  large  graphs  con¬ 
veniently. 

The  database  schema  used  by  GRAB  allows  users  to  specify  a  map  from  the  user’s  database 
schema  so  that  the  interface  can  be  used  for  various  kinds  of  data  without  much  effort. 

Finally,  GRAB  will  draw  graphs  in  a  systematic  form  known  as  a  proper  hierarchy.  In  this 
form,  graphs  are  partitioned  into  levels,  with  arcs  directed  from  an  upper  level  to  the  next  lower 
level.  Long  arcs  which  traverse  more  than  one  level  are  shortened  with  the  addition  of  dummy 
nodes  at  each  intermediate  level.  Cycles  within  levels  are  collapsed  into  new  nodes  called  proxies. 
Heuristics  are  used  to  minimire  the  arc-  crossings  in  the  graph  and  position  the  nodes  on  each 
level.  The  net  result  is  a  reasonable-looking  graph  which  clearly  conveys  the  intended  entity- 
relationships. 

The  proper  hierarchy  layout  of  graphs  does  have  a  few  problems.  The  major  problem  is 
that  no  provision  is  made  for  the  expansion  of  the  proxies  in  the  graph.  This  probb  m  could  be 
solved  either  by  expanding  proxy  constituents  around  the  circumference  of  a  circle  or  by  precom¬ 
puting  a  layout  from  proxies  with  a  limited  number  of  nodes.  In  addition,  the  introduction  of 
same-level  arcs  to  the  hierarchy  can  reduce  the  number  of  levels  and  dummy  nodes  in  the  graph. 

In  its  current  implementation,  the  performance  of  the  graph  drawing  algorithm  deteriorates 
somewhat  for  large  graphs.  A  major  part  of  this  is  caused  by  a  lack  of  bookkeeping  in  the  arc 
minimization  heuristics. 

In  order  to  improve  the  performance  of  GRAB  on  large  graphs,  some  consideration  will  have 
to  be  given  to  a  precomputation  and  incremental  update  scheme.  This  would  involve  precomput¬ 
ing  the  proper  hierarchy  and  layout  for  a  graph  and  updating  this  computation  whenever  the 
graph  is  changed.  The  tradeoff  between  time  spent  running  the  graph  drawing  algorithm  anew 
and  that  spent  updating  an  old  computation  has  yet  to  be  examined. 

3-3.3.  Machine  Specific  Code  Improvements 

The  translation  of  a  progran  ng  language  onto  a  target  architecture  requires  analyses  of 
both  the  language  and  the  architectu.  Much  of  the  analysis  of  programming  languages  is  now 
formalized  and  incorporated  in  compiler  construction  tools  for  the  “front-ends”  of  compilers. 
Analysis  of  a  target  architecture  by  construction  tools  for  the  “back-ends”  of  compilers  is  of  at 
least  two  kinds.  It  is  sufficient  to  discover  an  implementation  of  each  language  construct  on  the 
target  arcbitecture[l3].  Additional  analysis  may  discover  features  of  the  architecture  that  can  be 
exploited  to  generate  more  efficient  code. 

Often  a  target  architecture  contains  general  purpose  instructions  and,  in  addition,  special 
purpose  instructions  that  perform  the  same  operations  for  a  restricted  set  of  operands  (for  exam¬ 
ple,  addition  versus  increment).  Such  special  purpose  instructions  are  often  faster  or  smaller  than 
the  equivalent  more  general  instructions  A  code  generator  that  avoids  less  efficient  sequtnces  in 
favor  of  more  efficient  equivalent  instructions  produces  better  code.  The  ana'ysis  of  what  restric¬ 
tions  must  hold  to  use  special  purpose  instructions  is  tedious  and  prone  to  error  if  done  by  band, 
and  is  susceptible  to  automation.  Such  analysis  takes  a  suitable  machine  description  and  discov¬ 
ers  when  sequences  of  genera]  purpose  instructions  are  equivalent  to  special  purpose  instructions. 
One  may  think  of  the  analysis  as  imposing  a  set  of  constraints  on  general  purpose  instructions 
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that  make  them  equivalent  to  a  special  purpose  instruction. 

A  compiler  construction  tool  has  been  designed  and  built  that  automates  much  of  the  case 
analysis  necessary  to  exploit  special  purpose  instructions  on  a  target  machine.  Given  a  suitable 

°  J  ^  n  m*ChlD'j  the  identifies  ““truction  sequences  that  are  equivalent 

to  single  instructions  During  code  generation,  these  equivalences  can  be  used  to  avoid  ineffici*nt 
sequences  in  favor  of  more  efficient  instructions. 

r  A  *°r^Dg  Pro^ype  of  the  instruction  set  analyser  needed  in  the  framework  outlined  by 
“  presented  by  P.  Kessler[l5],  In  contrast  to  the  work  presented  jn  Davidson  and 
Fraaer[16j  [17],  machine  descriptions  arc  analysed  entirely  during  compiler  construction  lie.  once 

pr  mLher  thaf  dumg  Code  SeDerUion  («•«•.  time  the  compiler  it  used)  R 

Kessler)  16j  describes  such  a  system  for  discovering  equivalent  instructions  for  instruction 
sequences  of  length  2.  The  techniques  presented  here  can  identify  instruction  sequences  of  arbi- 
trary  length  that  are  equivalent  to  single  instructions. 

Thu  analysis  has  been  applied  to  the  descriptions  of  two  machines,  and  the  results  have 
£>*[13]  *°  r*P  &Ce  hand'wntteD  CUf  Analysis  routines  in  an  otherwise  table  driven  code  genera- 

The  basic  idea  of  the  analysis  is  to  find  constraints  on  instruction  sequences  ao  that  they 
perform  the  same  on  the  computation  as  a  aingle  instruction  target  machine.  Unlike  most  other 
approaches  in  the  hterature,  the  sequences  are  found  by  iteompaeing  aingle  instructions  to  find 
sequences  that  can  be  constrained  to  be  equivalent. 

Previous  examinations  of  machine  descriptions  for  special  purpose  instructions  have  com¬ 
posed  instruction  ^queiices  for  analysis.  Using  composition,  sequences  of  presumed  inefficient 
code  sre  constructed  and  then  the  machine  description  is  consulted  to  find  a  more  efficient  imple¬ 
mentation  of  that  code. 

Using  the  composition  algorithms,  sequences  must  be  composed  before  the  machine  descrip- 
on  is  examined  Thus,  the  number  of  sequences  examined  is  an  exponential  function  of  the 
number  of  target  machine  instructions,  whose  degree  is  the  length  of  the  sequences  considered. 
The  composition  algorithms  are  limited  m  practice  to  considering  pairs  of  instructions.  The  com- 
po&itioD  Afitlysis  thus  finds  only  1-to- 1  And  2-to-l  equivalences. 

Instruction  sequences  may  be  extended  to  arbitrary  lengths  in  the  attempt  to  decompose  an 
instruction.  This  is  a  major  contribution  of  the  decomposition  technique.  The  complexity  of  the 
analysis  process  is  exponential  in  the  number  of  instructions  on  the  target  machine,  with  the 
degree  or  the  exponential  depending  on  the  lengths  of  the  sequences  found  to  be  equivalent  The 
sequences  do  not  grow  very  long,  since  most  architecture,  do  not  include  extremely  complex 
mstnictions  (that  can  be  decomposed  by  this  algorithm).  A  performance  improvement  is  achieved 
by  matching  the  tails  of  sequences  only  once.  In  addition,  trial  extensions  that  fail  (due  to 
mismatches  of  the  architecture  or  unsatisfisble  constraints  on  the  programs)  are  not  extended 
further.  Ihus,  the  number  of  sequences  considered  in  practice  is  acceptably  amall. 

Our  algorithm  works  from  the  end  of  the  instruction  towards  the  beginning.  A  more 
straightforward  technique  is  to  proceed  forward  through  the  instruction  descriptions,  using  code 
generation  techniques  to  discover  alternative  implementations.  Forward  decomposition  is  too 
greedy  to  find  certain  decompositions,  however. 

The  descriptions  of  the  machine  include  the  descriptions  of  the  computations  performed  by 
add-exsing  modes.  Thus  decomposition  may  discover  that  computations  implicit  in  operand 
addressing  may  be  used  to  replace  explicit  instructions.  For  example  a  move-effective-address 
instruction  with  a  source  operand  that  adds  a  constant  displacement  to  a  register  can  be  used  as 
an  alternate  implementation  for  the  addition  of  a  constant,  provided  the  other  addend  as  a  reris- 
ter.  6 


Hie  decompoeition  algorithm  discovers  code  sequences  that  perform  equivalent  eomputv 
tions.  The  sequences  are  often  not  equivalent  in  cost  (the  difference  in  costs  it  the  motivation  for 
identifying  otherwise  equivalent  aequences!).  Both  code  space  and  execution  time  must  be  cod- 
aidered.  Accurate  costs  for  sequences  cannot  be  compared  during  analysis  at  compiler 
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construction  time,  however.  In  part  this  is  because  instructions  are  analysed  for  correspondence 
of  operands,  without  neceaaarily  restricting  operands  to  particular  addressing  modes.  Therefore, 
tiie  costs  for  operands  can  not  be  accurately  determined  until  a  particular  program  is  compiled. 
As  an  extreme  ease,  the  VAX- 11  has  equivalent  code  sequences  where  the  choice  of  which  sequence 
is  best  depends  on  the  compile  time  value  of  an  operand  displacement.  Thus,  cost  functions  can 
not  be  analysed  during  compiler  construction.  * 

The  prototype  has  been  used  to  analyse  two  architectures,  the  VAX- 11  and  the  M 68000  The 
analysis  of  the  VAX-11  takes  just  over  2  VAX- 11/750  cpu  hours  and  discovers  almost  1800  idioms 
The  analysis  of  the  M68000  takes  just  under  4  VAX- 1 1/750  cpu  hours  and  discovers  over  600 
idioms.  'The  longest  sequences  discovered  were  of  length  3  on  both  architectures. 

The  analyser  exploits  several  properties  of  the  machine  descriptions  to  reduce  the  amount  of 
work  required  of  it.  For  example,  families  of  instructions  that  vary  only  in  the  type  of  their 
operands  can  often  be  analysed  only  once.  In  addition,  many  instructions  in  a  target  architecture 
can  be  shown  to  jerform  unique  operations,  and  thus  there  is  no  need  to  decompose  them  or  to 
use  them  in  the  decomposition  of  other  instructions. 

The  addition  of  retargetable  code  improvers  to  the  suite  of  compiler  construction  tools 
improves  the  overall  quality  of  the  compilers.  The  uniform  application  of  suck  tools  provides  a 
standard  of  code  generator  quality,  which  makes  it  possible  to  compare  machine  architectures. 
The  availability  of  compilers  that  can  exploit  sjxcjJ  purpose  instructions  frees  machine  architects 
to  design  such  instructions  into  new  machines. 

8.3.4.  Experiences  with  Code  Generation  for  Ada 

An  efficient  runtime  representation  of  Ada  programs  was  designed  and  implemented,  using 
the  AT&T  Bell  Laboratories  Ada  Breadboard  Compiler  as  a  foundationjlO].  For  lack  of  time  the 
implementation  did  not  include  tasking,  generic  packages,  or  many  of  the  implementation  depen¬ 
dent  features  but  did  include  most  other  features  of  the  language.  Our  implementation  was  con¬ 
cerned  primarily  with  the  compiler  phase  starting  from  the  high  level  intermediate  representation 
DIANA  and  translating  to  the  low-level  IR  of  the  portable  C  compiler,  which  is  also  the  input 
language  for  eodegen[l3]. 

Conclusions  will  be  presented  in  the  following  way.  First,  the  experiences  we  had  with 
DIANA  and  the  C  IR  are  summarized.  Then  the  implications  our  runtime  system  design  goals  had 
on  the  actual  BAC  implementation  are  discussed.  Next,  performance  measurements  of  the 
current  BAC  implementation  are  provided.  With  these  figures  we  give  reasons  why  performance 
data  of  the  BAC  and  the  ABC  middle  ends  cannot  be  compared.  Finally,  the  execution  period 
znance  of  Berkeley  Pascal  and  the  BAC  is  compared. 

Conclusions  about  DIANA 

Working  with  the  DIANA  representation  was  not  a  pleasant  experience.  The  source  reprodu¬ 
cibility  requirement  of  the  DIANA  design  caused  much  of  the  DIANA  structure  to  be  unnorm  sl¬ 
ued,  hence  more  complex,  larger,  and  less  usable  by  the  back  end  In  addition,  DIANA  is  not 
particularly  well  designed  for  use  by  either  the  front  end  or  the  back  end  because  important 
features  like  the  symbol  table  are  poorly  represented.  It  is  clear  that  the  DIANA  designers  con¬ 
sidered  the  environment  tools  the  most  important  users  of  DIANA,  and  gave  its  space  efficiency 
and  compiler  usability  less  consideration  than  they  deserved. 

Conclusions  about  the  IR 

Using  the  IR  was  a  good  decision.  Because  the  IR  provides  a  flexible  low  level  representa¬ 
tion  that  does  not  require  the  user  to  think  about  details  such  as  register  allocation,  it  is  con¬ 
venient  and  easy  to  use.  Perhaps  the  greatest  fault  in  the  IR  is  that  it  is  being  used  for  more  pur¬ 
poses  than  it  was  originally  intended.  The  difference  between  the  success  of  the  IR  and  the  failure 
of  DIANA  is  clear.  The  IR  is  a  form  that  was  intended  to  ease  retargeting  C  compilers  to  different 
architectures.  It  was  not  designed  to  be  a  low  level  representation  for  production  quality  com¬ 
pilers  of  many  languages.  The  reason  that  the  IR  is  so  successful  is  that  its  value  has  been 
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established  through  much  experience  with  it.  From  its  inception  DIANA  was  designed  to  be  a 
st*odard  for  Ada  intermediate  representations  However,  this  was  decided  long  before  experience 
with  the  DIANA  representation  had  shown  one  way  or  the  other  that  it  is  a  good  representation. 
The  itsoral  is  that  practice  with  any  representation  is  the  only  way  to  determine  its  true  value. 

Conclusions  about  the  BAC  Runtime  System 

The  runtime  system  described  in[l8]  was  designed  to  be  an  efficient  representation  of  the 
features  necessary  for  a  complete  Ada  runtime  environment.  Because  the  system  shares  as  much 
descriptor  information  as  possible,  it  does  not  provide  uniform  access  techniques  for  an  object’s 
descriptor.  This  non-uniformity  means  that  the  compiler  has  to  be  more  intelligent  about  the 
context  and  type  of  a  particular  object.  Thus,  our  runtime  system  attempts  to  achieve  efficient 
representation  at  the  cost  of  greater  compiler  complexity. 

There  are  some  problems  with  this  approach.  The  increased  complexity  causes  the  middle 
end  to  be  even  larger  than  it  already  has  to  be  to  implement  all  the  features  of  Ada.  Our  imple¬ 
mentation  of  the  middle  end  is  approximately  20,500  lines  of  C  source  code,  including  the  normal- 
iration  4  base  and  IR  implementation.  The  Ada  Breadboard  Compiler  middle  end,  which  has  C  as 
its  target  language,  contained  approximately  6200  lines  of  C8.  Their  effort  was  not  intended  to  be 
of  production  quality  but  the  site  difference  is  still  significant.  Since  the  middle  end  is  the  most 
difficult  part  of  an  Ada  compiler  to  retarget,  an  important  part  of  its  design  should  be  simplicity. 

Still,  there  are  clear  advantages  to  our  runtime  representation.  Sharing  descriptors  at  run¬ 
time  saves  considerable  stack  space,  and  saves  the  execution  time  spent  initialising  redundant 
descriptors.  Using  an  optimixer  capable  of  dead-variable  elimination,  many  of  the  descriptors 
that  are  present  but  not  referenced  will  be  eliminated  altogether.  With  this  representation,  any 
Ada  type  declaration  which  would  be  legal  in  Pascal  (i.e.  static  arrays,  and  non-discriminated 
records)  would  incur  no  runtime  overhead  not  also  present  in  the  Pascal  implementation.  In  thL 
representation,  none  the  overhead  (type  descriptors,  thunks,  jthunks,  etc.)  imposes  a  distributed 
execution  overhead  on  programs  that  do  not  use  the  complex  features. 

The  Performance  of  the  Berkeley  Ada  Compiler 

Some  useful  comparisons  can  be  made  between  the  Berkeley  Ada  Compiler  and  the  Ada 
Breadboard  Compiler.  Due  to  the  work  of  Murphy,  the  BAC  DIANA  representation  is  much 
smaller  than  the  ABC's  representation.  Comparative  statistics  for  a  small  test  program*  are  pro¬ 
vided  in  Table  1  and  Table  2,  which  were  adapted  from  Murphy[20],  Problems  arise  when  one 
tries  to  compare  the  BAC  and  the  ABC  middle  ends.  Because  we  received  an  early  version  of  their 
middle  end,  which  we  intended  to  reimplement,  the  state  of  the  middle  end  we  have  for  com¬ 
parison  is  incomplete.  In  addition,  efforts  at  AT&T  have  more  recently  gone  into  building  a  pro¬ 
duction  compiler  from  scratch,  so  figures  reflecting  a  complete  version  of  their  middle  end  are 
impossible. 

Nevertheless,  in  an  effort  to  make  a  meaningful  analysis  of  the  quality  of  the  code  generated 
by  the  BAC  middle  end,  comparisons  will  be  made  with  pc,  the  Berkeley  Unix  Pascal  compiler. 
The  BAC  compiles  source  to  object  code  at  approximately  160  lines  per  minute,  which  is  2.3 
times  slower  than  pc.  While  part  of  this  poor  performance  can  be  attributed  to  the  complexity  of 
DIANA,  the  middle  end  accounts  for  almost  a  quarter  of  the  total  compile  time. 

Table  3  contains  a  comparison  of  execution  times  for  8  small  benchmark  programs.  Perm 
generated  all  permutations  of  7  objects.  Towers  solved  the  Towers  of  Hanoi.  Queens  solved  the  8 
queens  problem.  IntMM  and  MM  did  an  integer  and  real  matrix  multiply,  respectively.  Puxxle 
has  been  introduced.  Quick,  Bubble,  and  Tree  were  all  sorting  algorithms.  FFT  solved  a  fast 

*  This  ifure  is  somewhat  suspect  Set  the  discussion  in  the  folk>win|  section. 

*  The  program  wss  the  ubiquitous  Puttie  program  of  Forest  Basket'.  upon  which  seemingly  thoasasds  0 1  aaalyset 
have  unfortunately  been  based. 
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Swap  repeatedly.  The  overhead  of  passing  the  italic  link  to  the  callee,  and  setting  up  the  static 
linkage  was  the  eause  of  the  poor  performance.  A  suggested  modification  of  the  simple  static 
link&ge  model  presented  here  would  solve  the  problem,  but  also  considerably  complicate  the 
model. 

We  view  the  favorable  comparison  with  a  language  as  relatively  simple  as  Pascal  as  evi¬ 
dence  that  our  runtime  implementation  was  successful.  In  conclusion,  we  feel  that  the  runtime 
system  suggested  provides  efficient  execution  time  performance  with  little  distributed  overhead 
and  offers  a  conceptually  simple  and  useful  runtime  model  for  Ada. 

Smalltalk  Implementation  Techniques  for  a  RISC  Architecture 

There  are  several  reasons  why  Smalltalk  programs  have  proven  especially  difficult  to  execute 
quickly. 

e  The  language  has  been  defined  in  terms  of  a  bytecode  interpreter.  Interpreters  are  slow. 

•  The  pure  object-orientation  of  the  language  implies  a  huge  number  of  procedure  calls  (“mes¬ 
sages”),  which  are  often  time-consuming  in  conventional  implementations. 

•  The  definition  of  Smalltalk  execution  and  the  style  of  its  customary  use  require  the  rapid  crev 
tion  and  automatic  reclamation  of  many  objects.  This  puts  a  heavy  demand  on  the  memory 
management  mechanism. 

We  have  implemented  Smalltalk-80  on  an  instruction-level  simulator  for  a  RISC  microcom- 
called  SOAR.  Measurements  suggest  that  even  a  conventional  computer  can  provide  hieh 
performance  for  Smalltalk-80  by 

•  abandoning  the  ‘Smalltalk  Virtual  Machine’  in  favor  of  compiling  Smalltalk  directly  to  SOAR 
machine  code, 

•  compiling  Smalltalk  di  tly  to  SOAR  machine  code, 

•  linearising  the  activation  records  on  the  machine  atack, 

•  eliminating  the  object  table,  and 

•  replacing  reference  counting  with  a  new  technique  ealled  Generation  Scavenging.  In  order  to 
implement  these  techniques,  we  had  to  find  new  ways  of 

•  hashing  objects, 

•  accessing  often-used  objects,  invoking  blocks, 

•  referencing  activation  records, 

•  managing  activation  record  stacks,  and 

•  converting  the  virtual  machine  images. 

These  techniques  have  been  summarised  by  Samples  et.  al{2l]. 

*•8*6  The  PAN  Language-Baaed  Editor 

Pan  is  a  multilingual  language-based  editor  for  manipulating  tree-structured  documents. 
The  editor  supports  both  tree-  and  text-oriented  operations.  The  expected  use  of  this  system  is  as 
the  front-end  for  a  development  environment  in  which  experienced  developers  use  several 
languages  while  creating  t  complex  program  or  other  document  One  task  of  the  front-end  is  to 
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gather  and  make  available  information  about  the  document  for  use  by  the  developers  and  by 
other  tools. 

Multiple  languages  are  handled  by  separating  the  language-specific  information  from  the 
eeneric  utilities  supplied  by  the  editor.  Language-apecific  information,  in  the  form  of  a  language 
description,  is  preprocessed  into  tables  for  use  by  the  editor.  The  editing  component  jjself  is 
table-driven.  New  languages  can  be  added  to  the  system  by  creating  and  loading  a  new  set  of 
tables.  Pan  is  designed  to  handle  different  languages  in  different  editing  workspaces;  twitch ing 
workspaces  within  an  editing  session  allows  the  user  to  edit  different  languages. 

There  are  two  major  components  to  the  Pan  system:  the  editor  and  the  table  generator. 
The  editor  supplies  editing  operations  while  checking  that  the  document  meets  the  requirements 
of  the  language  in  which  it  is  written.  These  requirements  fall  into  three  categories:  lexical,  syn¬ 
tactic,  and  contextual.  (Contextual  requirements  are  often  called  the  “static  semantics  of  a 
language.)  Information  concerning  errors  or  inconsistencies  in  the  document  is  communicated  to 
the  user  during  the  course  of  editing. 

The  editor  uses  both  the  concrete  representation  of  a  document  (the  representation  as  seen 
by  a  user  of  the  system)  and  the  abstract  syntax  of  the  document  to  implement  its  editing  opera¬ 
tions  The  correspondence  between  the  two  representations  is  maintained  by  an  incremental  scan¬ 
ning  and  parsing  system.  The  abstract  syntax  is  in  the  form  of  an  operator/pnylum  tree|22].  Con¬ 
textual  constraints  are  enforced  using  only  the  abstract  syntax.  Other  tools  in  the  environment 
may  add  information  to  the  internal  tree  representation;  it  is  the  structure  of  the  tree  which  is  of 
primary  interest  to  the  editor. 

The  table  generator  takes  a  language  description,  checks  it  for  consistency  and  for  the  pro¬ 
perties  required  by  the  algorithms  used  in  Pan,  and  then  generates  the  tables  used  by  the  editor. 
In  fact,  the  table  generator  is  a  collection  of  tools,  many  of  which  already  exist  in  the  UNIX  pro¬ 
gramming  environment. 

Pan  will  be  implemented  on  a  SUN  workstation.4  The  primary  implemenUtion  language  will 
be  LISP,  with  recourse  to  C  for  low  level  routines  and  access  to  the  screen. 


S .3.7.  The  VorTeX  System 

The  goal  of  the  Berkeley  VorTeX6  project  is  to  build  a  new  document  processing  environ¬ 
ment  with  the  following  major  features: 


Interactive-  The  system  allows  the  user  to  edit/preview  TeX  documents  interactively.  In 
order  to  achieve  a  satisfactory  degree  of  interaction  the  system  must  be  incremental  (as 
opposed  to  batch)  in  both  reformatting  and  redisplay;  namely  it  only  reprocesses  those  por- 
tions  of  the  document  that  have  been  changed  during  an  editing/pre*  ewing  session. 

Multiple  Representations:  Both  source  (ASCII  representation)  and  proof  (bitmap  representa¬ 
tion!  of  the  document  are  maintained  by  the  system,  with,  of  course,  an  intermediate 
representation  transparent  to  the  user.  The  user  is  allowed  to  edit  both  representations, 
whichever  he/she  considers  more  convenient.  Most  importantly,  changes  made  to  one 
representation  must  be  propagated  to  the  other  automatically. 

Extensible:  The  system  must  be  flexible  enough  to  incorporate  objects  such  as  program  frag¬ 
ments,  tables,  graphics .  etc.  In  particular,  TeX1.  powerful  macro  facility  must  be 

preserved. 

Easy  to  use  and  reasonably  portable. 


*SUN  workataiion  is  a  trademark  of  Sun  Microtyeteme,  Inc 
•  For  ViiuiHy  Oriented  TeX. 


—  December  1885  — 


Final  Report 


-  43  - 


Software  Development  Systems 


Based  on  the  design  goals  listed  above,  the  following  requirements  have  been  identified: 

(1)  VorTeX  and  TeX  must  be  able  to  generate  identical  outputs. 

(2)  The  user  should  feel  more  comfortable  with  VorTeX  than  with  the  current  disintegrated 

environment. 

VorTeX V  fundamental  departure  from  other  systems  is  in  the  notion  of  multiple  representa¬ 
tions.  Being  interactive  and  incremental  is  nothing  new  in  the  field  of  text  processing.  A  number 
of  so-called  WYSWYG  (What  You  See  is  What  You  Get)  programs  have  emerged  in  the  past  few 
yean  that  advocate  friendly  user  interfaces.  Unfortunately,  such  nice  user  interfaces  often  turn 
out  to  be  discriminating  against  professional  and  experienced  users  —  power  and  ease  of  use 
sometimes  contradict  each  other.  Furthermore,  of  the  many  WYSWYG  systems  available,  none 
seem  to  be  able  to  oroduce  documents  of  higher  quality  than  that  of  TeX,  a  batch-oriented  and 
far  less  interactive  sj  m. 

It  is  our  belief  that  having  a  good  user  interface  is  important,  but  not  at  the  cost  of  the 
quality  and  flexibility  of  TeX.  The  ideal  scenario  is  that  the  user  be  allowed  to  modify  the 
appearance  of  the  document  as  in  any  WYSWYG  systems  and  that  he/she  be  able  to  enter  high- 
level  formatting  commands  if  that  is  considered  more  convenient.  The  implication  is  that  both 
source  sod  proof  of  the  document  must  be  svsilsble  in  the  environment,  and  the  corresponding 
representation  coherence  problem  becomes  central  to  this  research.  This  multiple  representation 
problem  is  not  unique  to  document  processing  per  se.  Almost  all  large-scale  software  systems 
such  as  VLSI,  CAD,  programming  environments,  and  databases  have  similar  problems. 

We  propose  an  obket-oriented  approach  to  this  whole  environment.  The  smallest  object  of 
granularity  is,  in  Tex’s  jargon,  an  “hbox”.  An  object  contains  pointers  to  its  positions  in  both 
representations,  pointers  its  neighboring  objects,  and  (implicit)  pointers  to  the  class  it  inherits.  It 
also  maintains  some  format  attributes  such  as  font  type,  site,  ...,  etc.  and  the  information  for 
handling  representation  coherence,  i.e.  a  declarative  set  of  legitimate  bitmap  operations  with 
respect  to  this  particular  object  type  and  the  corresponding  TeX  commands.  Changes  made  to  an 
object  are  broadcast  to  related  parties.  Higher  order  objects  such  as  paragraphs  and  pages  have 
access  to  the  various  formatting  and  displaying  modules  and  based  on  some  heuristics  decide 
when  to  incrementally  reprocess  the  document.  The  advantages  of  this  approach  are: 

(1)  Open  system:  new  objects  and  modules  can  be  plugged  in  easily. 

(2)  Incremental:  message  passing  triggers  the  necessary  updates  at  some  reasonable  granularity. 

Work  on  VorTeX  was  begun  under  this  contract  and  is  continuing  under  the  new  contract. 
In  addition  to  developing  the  overall  design  of  the  project,  we  have  obtained  the  results  described 
in  the  following  sections. 

Fonts  for  TeX 

In  the  process  of  providing  fonts  for  the  various  output  devices  that  are  available  to  pi-'duce 
TeX  output,  we  have  encountered  many  situations  that  require  non-standard  fonts.  One  good 
example  of  this  is  providing  fonts  for  the  Sun  workstation  preview  program  dvltool.  At  the  low 
resolution  of  the  screen,  it  is  difficult  to  create  good  looking  fonts,  and  most  fonts  require  tuning 
to  be  even  readable. 

There  has  always  been  a  difficulty  in  tuning  existing  fonts  because  the  files  are  binary  and 
cannot  be  easily  modified  with  the  existing  editors.  Previously,  the  only  way  to  do  this  was  by 
converting  the  font  into  an  ascii  file  with  the  character  bitmaps  represented  as  an  array  of  aster¬ 
isks  and  periods,  which  could  he  edited  After  the  file  was  changed,  this  ascii  representation  could 
be  converted  back  to  the  binary  font  format.  Needless  to  say,  this  was  a  cumbersome  way  to 
make  changes  and  only  relatively  small  fonts  could  be  dealt  with.  Also,  creating  new  foots  was 
not  easily  done  because  of  the  fora  at  of  the  ascii  representation  files. 
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To  help  in  modifying  the  binary  fonts,  we  have  written  pxtool  which  ia  a  Sun  Windows  bit¬ 
map  editor  for  the  pixel  fonts  It  takes  a  font  file  and  allows  one  to  change  the  pixels  in  the  font 
on  the  screen.  The  format  of  the  tool  is  something  like  the  standard  bitmap  editor  IeonTool  pro¬ 
vided  by  Sun.  The  character  image  is  actually  changed  by  clicking  mouse  keys  at  the  appropriate 
pixels  and  font  parameters  may  be  changed  by  typing  them  to  prompts  at  the  top  of  the  tool. 

D VI tool:  TeX  Without  Paper 


DVJtool  is  a  previewer  for  TeX  that  operates  on  the  SUN  family  of  computers  under  the 
SUN  window  system.  It  is  an  offspring  in  a  long  line  of  DVI-to-whatever  programs  which  differs 
in  one  fundamental  aspect;  it  is  run  in  its  own  window  concurrently  with  an  arbitrary  number  of 
other  processes.  DVItool  is  meant  to  be  run  once  during  the  entire  computing  session  rather  than 
executed  each  time  a  DVI  file  is  to  be  previewed.  Concurrency  has  mandated  that  DVItool  bt 
robust  enough  in  its  error  handling  and  flexible  enougn  in  its  command  structure  to  be  useful 
throughout  a  potentially  lengthy  session. 


A  number  of  features  have  been  added  to  DVItool  to  make  it  a  powerful  and  flexible  tool: 
the  ability  to  change  directories,  a  single  keystroke  command  to  re-read  a  potentially  updated 
DVI  file,  the  ability  to  do  wild-card  page  searches  on  any  of  TeX’s  ten  count  variables,  an  initiali¬ 
sation  file  to  allow  a  degree  of  user  customisation,  and  six  levels  of  magnification  corresponding  to 
TeX’s  six  magsteps. 


In  addition,  another  program,  TeXdvi,  has  been  written  to  simplify  the  ediUTeX-preview 
cycle.  The  standard  method  of  creating  TeX  documents  is  to  edit  the  TOC  document  with  one’s 
favorite  text  editor,  run  TeX  on  the  edited  document,  and  then  send  the  resulting  DVI  file  to 
either  a  previewer  or  a  hard-copy  device.  TeXdvi  simplifies  this  cycle  by  consolidating  the  last 
two  steps:  it  runs  TeX*  on  the  TeX  source  and  then  sends  DVItool  a  message  to  preview  the 
resulting  DVI  file.  If  there  are  errors  in  the  TeX  job,  TeXdvi  queries  the  user  about  whether  to 
preview  the  DVI  file. 

• 

The  combination  of  DVItool  and  TeXdvi  alleviates  much  of  the  tedium  associated  with  the 
creation  of  documents  using  a  batch-oriented  text  processor  such  as  TeX. 


4.  Research  in  Expert  Database  Systems 


4.1.  Introduction 


Under  this  contract  we  have  developed  an  extended  language,  QUEL*,  containing  the  fol¬ 
lowing  capabilities: 


the  ability  to  support  procedural  data  and  the  capability  to  execute  it 
the  ability  to  reference  aubobjects  within  procedurally  defined  objects 
the  ability  to  perform  indefinite  iteration 

We  have  also  built  an  optimizer  for  this  language,  and  performed  initial  exploration  on  how  to 
generalise  the  language  to  provide  a  rules  system.  These  three  topics  are  diacuased  in  the  follow¬ 
ing  sections. 


4.2.  The  Definition  of  QUEL* 

Basically,  any  column  of  a  relation  in  QUEL*  can  bold  a  data  element  that  is  a  procedure. 
For  example,  the  following  tuple  contains  data  on  an  employee  indicating  his  name,  aal&ry, 
mana«.:r,  age,  and  the  department  that  he  works  in.  The  last  data  object  is  defined  procedurally. 

salary  manager  age  dept 


name 


Mike  2000  Fred  25  retrieve  (DEPT.all)  where  DEPT.name  ”  “shoe* 


*  or  say  other  TeX  proccsior  such  u  LtTcX  or  SliTeX. 
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Ad  EMP  Tuple 
Table  1 

Procedural  data  can  be  executed  as  follows: 

execute  (EMP. dept)  where  EMP  .name  ■=  “Mike"  m  . 

Moreover,  if  a  DEFT  object  has  fields  dname  and  floor,  then  one  can  find  the  floor  number  that 
Mike  works  on  by  running  the  following  command. 

retrieve  (EMP.DEPT.floor)  where  EMP. name  =  “Mike" 

Here,  we  use  a  nested  dot  notation,  e.g.  EMPDEPT.floor,  to  reference  a  subject  within  the  pro¬ 
cedural^  defined  object  DEPT.  This  addressing  was  pioneered  by  the  data  sublanguage, 
GEM[23j.  Lastly,  indefinite  iteration  of  commands  are  supported  by  including  a  •  operator. 
Hence,  we  can  update  the  salaries  of  all  employees  to  be  no  larger  than  that  of  their  direct 
manager  or  anyone  else  that  they  report  to  indirectly  as  follows: 

replace*  EMP  (salary  «*=  M  sal  ary)  from  M  in  EMP 
where  EMP. manager  “  M.name 
and  EMP  .sal  ary  >  Msalary 

The  •  indicates  that  logically  the  command  should  continue  to  execute  until  it  no  longer  has  any 
effect. 

4.3.  Optimisation  of  QUEL* 

We  have  spent  considerable  time  designing  an  optimiser  for  QUEL*  and  in  extending  the 
prototype  INGRES  relational  database  system[24]  to  optimise  and  execute  QUEL*  commands. 
We  illustrate  tbe  design  of  the  optimiser  with  an  extended  example  in  this  section.  The  full 
report  on  the  design  and  implementation  of  QUEL*  has  been  accepted  for  publication  in  the  ACM 
Transactions  on  Database  Systems[25j. 

Although  more  sophisticated  query  processing  algorithms  have  been  construe  ted  [26]  [27]  our 
implementation  builds  on  the  original  INGRES  strategy[28].  The  implementation  of  QUEL*  has 
been  accomplished  using  this  code  because  it  is  readily  available  for  experimentation.  Integration 
of  our  constructs  into  more  advanced  optimizers  appears  straightforward,  and  we  discuss  this 
point  in  more  detail  at  the  end  of  this  section. 

Detachment  of  one-variable  queries  that  do  not  contain  multiple  dot  or  relation  level  opera¬ 
tors  can  proceed  as  in  the  origins!  INGRES  algorithms[28].  Similarly,  the  reduction  module  of 
decomposition  is  unaffected  by  our  extensions  to  QUEL.  In  addition,  tuple  substitution  is  per- 
formed  when  all  other  processing  steps  fail.  A  glance  at  tbe  left  hand  column  of  Figure  1  indi¬ 
cates  that  a  test  for  sero  variables  must  be  inserted  into  tbe  original  flow  cf  control  after  the 
reduction  module.  Then,  new  facilities  must  be  included  to  process  the  “yes”  branch  of  the  test. 
These  include  a  test  lor  whether  there  is  a  relation  to  materialise  and  the  code  to  perform  this 
step.  Lastly,  the  one-variable  query  processor  must  be  extended  to  process  relation  level  operators. 
We  explain  these  extensions  wuh  a  detailed  example. 

The  desired  task  is  to  find  the  polygon  descriptions  with  identifiers  less  than  5  for  all  objects 
which  have  the  same  collection  of  shapes  as  the  complex  object  with  Oid  equal  to  10,  i.e: 

range  of  o  is  OBJECT 
range  of  ol  is  OBJECT 
retrieve  (o-shape.p-desc) 

where  o.shape.Pid  <  5 
and  oshape  cl  shape 
and  ol.Oid  ■■=  10 

Tbe  initial  step  of  the  reduction  process  finds  that  the  final  clause  in  the  query  has  only  a  single 
variable,  so  it  can  be  executed  as: 

retrieve  into  TEMP-1  (ol shape)  where  ol.Oid  ■*  10 
The  original  query  is  now: 


—  December  1085  — 


•  ft 


*{y 

“A  A  A 


Final  Report 


-  40  - 


Expert  Database  Systems 


Figure  1  shows  a  diagram  of  the  extended  decomposition  proci  *. 
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Extended  Decomposition 
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retrieve  (o.sbape.p-desc) 

where  onhapePid  <  5 

and  o .shape  ■=*=  lEMP-l-sbape 

The  first  clause  above  contains  a  multiple  dot  attribute  and  should  not  be  processed  until  later. 
At  this  point  reduction  fails  and  the  query  still  has  two  variables  in  it,  so  processing  falls  through 
to  the  tuple  substitution  module  If  TEMP-1  is  selected  for  substitution,  the  resulting  query  is: 

retrieve  (o.ahape.p-desc) 

where  o.shapePid  <  5 

and  o .shape  —  ■-  ’’QUEL^constant*! 

Notice  that  the  variable  "TEMP-.-shape"  has  been  replaced  by  a  constant  (QUO^onstant-1) 
which  is  a  collection  of  QUEL  commands.  Processing  now  returns  to  the  lop  of  the  loop  where 
the  query  still  does  not  have  any  one-variable  clauses.  Processing  again  returns  to  tuple 
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substitution  where  the  variable  o  is  chosen.  This  results  in  the  query: 

retrieve  (“QUEL-constant-2”  .p  desc) 

where  “QUEL-constant-2”  Pid  <  6 
and  “QUEl-cr  astant-S*  —  ‘  '  I’EJ  •< 

Notice  that  o  shape  has  been  replaced  by  two  (QI  >jL-c  ouftaxit*2  and  QUEL-coflktact-3) 

which  are  identical  When  o.jhape  is  materialised,  here  will  be  a  one-relation  clause  (oahape.Pid 
<  5)  that  can  be  used  to  restrict  and  project  the  r?!nion.  Moreover,  it  b  desirable  to  check  thb 
clause  as  early  as  possible  because  the  current  query  vtill  hav  no  answer  if  thb  clause  b  fabe 
On  the  other  hand,  o.shape  must  be  retained  as  a  complete  object  so  that  the  the  relation  level 
comparison  with  QUEUconstant-1  can  be  performed  if  necessary.  In  order  to  avoid  forcing  the 
relation  level  operator  to  be  executed  first,  we  have  duplicated  the  QUEL  constant  and  thereby 
retained  the  option  of  performing  the  one-variable  restriction  first.  Even  though  QUEL-constant- 
2  and  QUEL-constant-3  defile  the  same  object,  the  caching  module  contained  in  our  prototype 
should  avoid  materialising  thb  object  more  than  once. 

Now  the  command  has  rero  variables  and  b  pa*t*ed  to  the  materialiie  module.  Thb  process¬ 
ing  step  chooses  one  of  the  QUEL  constants  and  maverialiies  tht  outer-union  of  the  RETRIEVE 
and  DEFINE  VIEW  commands  into  a  relation  TEMP  2.  If  “QUEL-constant-2”  is  chosen,  then 
the  resulting  query  will  be: 

retrieve  (TEMP-2. p-desc)  where 

“QUEL-constant-3”  —  “QUEL-constant-1" 
and  TEMP-2Pid  <  5 

Thb  query  now  has  a  one-variable  clause  which  can  be  detached  and  processed  creating  another 
temporary  relation  TEMP-3.  If  TEMP-3  b  empty  then  the  query  b  false  and  can  be  terminated. 
Alternately,  processing  must  continue  on  the  following  command. 

retrieve  (TEMP-3.p-desc)  where 

“QUEL-constanV3”  —  «*=  “QUEL-constant-1” 

The  qualification  b  again  free  from  variable';,  so  another  relation  must  be  materialised.  If 
‘■QUEL-constant-1*  b  chosen,  we  obtain: 

retrieve  (TEMP-3. p-desc)  where 

“QUEL-constant-3"  —  TEMP-4 

The  qualification  b  still  free  from  variables,  so  the  final  relation  must  be  materialised  as  follows: 

retrieve  (TEMP-3. p-desc)  where 
TEMP-5  — —  TEMP-4 

After  another  trip  around  the  processing  loop,  no  further  materialisation  b  possible.  Hence,  the 
query  must  now  be  passed  to  the  cne-variable  query  processor.  Thb  module  will  process  the 
operator  •mmm  for  the  two  relations  involved. 

Several  comments  are  appropriate  at  thb  time.  Thb  algorithm  delays  materialising  a  rela¬ 
tion  until  there  b  no  conventional  prxessing  to  do.  In  addition,  it  delays  evaluating  relation 
level  operators  until  there  b  nothing  else  to  do.  Thb  reflects  our  belief  that  expensive  operations 
should  never  be  done  until  absolutely  necessary. 

Second,  mo6t  current  optimiiers  build  a  comp  -te  query  plan  in  advance  of  executing  the 
command.  Such  optimiiers  (e  g  (25])  can  construct  a  plan  for  the  portion  of  the  query  without 
nested  dot  constructs  However,  run-time  planning  may  be  required  on  remaining  portions  or 
commands.  For  example,  the  following  query  must  be  processed  by  tuple  substitution  for  o  or  ol. 

retrieve  (o.shape. p-desc,  ol.shape.p-desc) 
where  o.shape.l-desc  *“  ol. shaped. desc 

After  substitution  twice,  the  remaining  query  b 

retrieve  (TEMP- 1. p-desc,  TEMP-2. p-dese) 
where  TEMP-1. 1-desc  —  TEMP-2  1-desc 
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The  characteristics  of  TEMP-1  and  TEMP- 2  are  not  known  until  run  time,  ao  further  query  plan¬ 
ning  must  be  deferred  to  this  time. 

Lastly  ,  m  our  prototype  the  module  that  materialises  a  relation  passes  the  RETRIEVE  com¬ 
mands  to  another  process  which  also  runs  the  INGRES  code.  This  second  INGRES  executes  the 
command,  stores  the  resulting  relation  in  the  database,  and  then  passes  control  back  to  Gt  first 
INGRES.  A  second  process  is  required  because  the  INGRES  code  will  not  allow  a  command  to 
suspend  in  the  middle  of  the  decomposition  process  ao  that  a  new  command  can  be  executed.  The 
ability  to  ‘stack*  the  execution  state  of  a  query  would  be  a  very  desirable  addition  to  the  system 


—  December  1885  — 


Final  Report  Appendix  A 


-  40  - 


Summary  of  facilities 


Appendix  A  -  Summary  of  facilities 


1.  Kerne!  primitives 

1.1.  Process  naming  and  protection 

setbostid 

getbostid 

setbostname 

getbostnsme 

getpid 

fork 

exit 

execve 

getuid 

geteuid 

setreuid 

getgid 

getegid 

getgroups 

setregid 

setgroups 

Retpgrp 

setpgrp 

1.2  Memory  management 

<mman.h> 

sbrk 

sstkf 

getpagesize 

mmapt 

mremapf 

munmapt 

mprotectt 

madvisef 

mincoret 

1.3  Signals 

<signal.h> 

sigvec 

kill 

kilipgrp 

sigblock 

•igsetmask 

•igpause 

sigstack 

1.4  Timing  and  statistics 

<syB/time.h> 
gettimeofday 
settimeofday 
geti  timer 
setitimer 


t  Not  supported  in  4.2BSD 


set  UNIX  host  id 

get  UNIX  host  id 

set  UNIX  host  name 

get  UNIX  host  name 

get  process  id 

create  new  process 

terminate  a  process 

execute  a  different  process 

get  user  id 

get  effective  user  id 

set  real  and  effective  user  id’6 

get  accounting  group  id 

get  effective  accounting  group  id 

get  access  group  set 

set  real  and  effective  group  id’s 

set  access  group  set 

get  process  group 

set  process  group 

memory  management  definitions 

change  data  section  size 

change  stack  section  size 

get  memory  page  size 

map  pages  of  memory 

remap  pages  in  memory 

unmap  memory 

change  protection  of  pages 

give  memory  management  advice 

determine  core  residency  of  pages 

signal  definitions 
set  handler  for  signal 
send  signal  to  process 
send  signal  to  process  group 
block  set  of  signals 
restore  set  of  blocked  signals 
wait  for  signals 
set  software  stack  for  signals 

time-related  definitions 
get  current  time  and  timezone 
set  current  time  and  timezone 
read  an  interval  timer 
get  and  set  an  interval  timer 


/• 

\ 
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profile  process 

descriptor  reference  table  sue 
duplicate  descriptor 
duplicate  to  specified  index 
close  descriptor 
multiplex  input/output 
control  descriptor  options 
wrap  descriptor  with  protocol 


<sys/resource.h> 

getpriority 

setpricrity 

getrusage 

getrlimit 

setrlimit 

1.7  System  operation  support 


mount  a  device  file  system 
add  a  swap  device 
umount  a  file  system 
flush  system  caches 
reboot  a  machine 
specify  accounting  file 

2.  System  facilities 

2.1  Generic  operations 

read  read  data 

write  write  data 

<sys/uio.h>  scatter-gather  related  definitions 

r*adv  scattered  data  input 

wntev  gathered  data  output 

<sys/ioctl.h>  standard  control  operations 

ioctl  device  control  operation 

2.2  File  system 

Operations  marked  with  a  •  exist  in  two  forms:  as  shown,  operating  on  a  file  name,  and 

operating  on  a  file  descriptor,  when  the  name  is  preceded  with  a  “f”. 


<sys/file.h> 

file  system  definitions 

chdir 

change  directory 

chroot 

change  root  directory 

mkdir 

make  a  directory 

rmdir 

remove  a  directory 

open 

open  a  new  or  existing  file 

mknod 

make  a  special  file 

portal  f 

make  a  portal  entry 

unlink 

remove  a  link 

stat* 

return  status  for  a  file 

t  Not  supported  is  4.2BSD. 


mount 

rwapon 

umount 

sync 

reboot 

sect 


resource-related  definitions 
get  process  prion ty 
set  process  priority 
get  resource  usage 
get  resource  limitations 
set  resource  limitations 


profil 

1.5  Descriptors 

getdtablesue 

dup 

dup2 

dose 

select 

fcntl 

wrapt 

1.0  Resource  controls 
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btat 
eh  own* 
chmod* 
u  times 
link 

symlink 

readlink 

rename 

keek 

truncate* 

access 

flock 

2.8  Communications 

<sys/socket.h> 

socket 

bind 

getsockname 

listen 

accept 

connect 

socketpair 

send  to 

send 

recvfrom 

recv 

send  ms g 

recvmsg 

shutdown 

getsockopt 

setsockopt 

2.4  Terminals,  block  and  character 


returned  status  of  link 
change  owner 
change  mode 

change  access/modify  times 
make  a  hard  link 
make  a  symbolic  link 
read  contents  of  symbolic  link 
change  name  of  file 
reposition  within  file 
truncate  file 
determine  accessibility 
lock  a  file 


standard  definitions 

create  socket 

bind  socket  to  name 

get  socket  name 

allow  queueing  of  connections 

accept  a  connection 

connect  to  peer  socket 

create  pair  of  connected  sockets 

send  data  to  named  socket 

send  data  to  connected  socket 

receive  data  on  unconnected  socket 

receive  data  on  connected  socket 

<end  gathered  data  and/or  rights 

receive  scattered  data  and/or  rights 

partially  close  full-duplex  connection 

get  socket  option 

set  socket  option 

devices 


2.6  Processes  and  kernel  hooks 
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Appendix  B  -  File  System  Implementation 

1.  Introduction 

This  appendix  describes  the  changes  from  the  original  512  byte  UNIX  file  system  to  the  new 
one  released  with  the  4.2  Berkeley  Software  Distribution.  It  presents  the  motivations  for  the 
changes,  the  methods  used  to  affect  these  changes,  the  rationale  behind  the  design  decisions,  and  a 
description  of  the  new  implementation.  This  discussion  is  followed  by  a  summary  of  the  results 
that  have  been  obtained,  directions  for  future  work,  and  the  additions  and  changes  that  have 
been  made  to  the  user  visible  facilities.  The  paper  concludes  with  a  history  of  the  software 
engineering  of  the  project. 

The  original  UNIX  system  that  runs  on  the  PDP-11!  has  simple  and  elegant  file  system 
facilities.  File  system  input/output  is  buffered  by  the  kernel;  there  are  no  alignment  constraints 
on  data  transfers  and  all  operations  are  made  to  appear  synchronous.  All  transfers  to  the  disk  are 
in  512  byte  blocks,  which  can  be  placed  arbitrarily  within  the  data  area  of  the  file  system.  No 
constraints  other  than  available  disk  space  are  placed  on  file  growth  [Ritchie74],  [Thompson78] 

When  used  on  the  VAX-11  together  with  other  UNIX  enhancements,  the  original  512  byte 
UNIX  file  system  is  incapable  of  providing  the  data  throughput  rates  that  many  applications 
require.  For  example,  applicarions  that  need  to  do  a  small  amount  of  processing  on  a  large  quan¬ 
tities  of  data  such  as  VLSI  design  and  image  processing,  need  to  have  a  high  throughput  from  the 
file  system.  High  throughput  rates  are  also  needed  by  programs  with  Urge  address  spaces  that  are 
constructed  by  mapping  files  from  the  file  system  into  virtual  memory.  Paging  data  in  and  out  of 
the  file  system  is  likely  to  occur  frequently.  This  requires  a  file  system  providing  higher 
bandwidth  than  the  original  612  byte  UNIX  one  which  provides  only  about  two  percent  of  the 
maximum  disk  bandwidth  or  about  20  kilobytes  per  second  per  arm  [White80],  (Smith81b]. 

Modifications  have  been  made  to  the  UNIX  file  system  to  improve  its  performance.  Since 
the  UNIX  file  system  interface  is  well  understood  and  not  inherently  slow,  this  development 
retained  the  abstraction  and  simply  changed  the  underlying  implementation  to  increase  its 
throughput.  Consequently  users  of  the  system  have  not  been  faced  with  massive  software  conver¬ 
sion. 

Problems  with  file  system  performance  have  been  dealt  with  extensively  in  the  literature; 
see  [Smith81a]  for  a  survey.  The  UNIX  operating  system  drew  many  of  its  ideas  from  Multics,  a 
large,  high  performance  operating  system  [Feiertag7l],  Other  work  includes  Hydra  [Almes78], 
Spice  [Thompson80],  and  a  file  system  for  a  lisp  environment  [Symbolics81a]. 

A  major  goal  of  this  project  has  been  to  build  a  file  system  that  is  extensible  into  a 
networked  environment  [Holler73].  Other  work  on  network  file  systems  describe  centralised  file 
servers  [Accetta80],  distributed  file  servers  [Dion80j,  [Luniewski77],  [Porcar82],  and  protocob  to 
reduce  the  amount  of  information  that  must  be  transferred  across  a  network  [SymboiicsSlb], 
(SturgisSO]. 


2.  Old  File  System 

In  the  old  file  system  developed  at  Bell  Laboratories  each  disk  drive  contains  one  or  more 
file  systems  !  A  file  system  is  described  by  its  super-block,  which  contains  the  basic  parameters  of 
the  file  system.  These  include  the  number  of  data  blocks  in  the  file  system,  a  count  of  the  max¬ 
imum  number  of  files,  and  a  pointer  to  a  list  of  free  blocks.  All  the  free  blocks  in  the  system  are 
chained  together  in  a  linked  list.  Within  the  file  s/stem  are  files.  Certain  files  are  distinguished 
as  directories  and  contain  pointers  to  files  that  may  themselves  be  directories.  Every  file  has  a 
descriptor  associated  with  it  called  an  inode  .  The  inode  contains  information  describing 

t  DEC,  PDP,  VAX,  MASSBUS,  sod  UNIBUS  are  trademark'  of  Di *>!il  Equipment  Corporation, 
t  A  lie  lyitem  alwaya  reaidet  on  a  ainfle  drive 


—  December  1885  — 


Final  Report  Appendix  B 


63  - 


Old  file  system 


ownerahip  of  the  file,  time  stamps  marking  last  modification  and  access  times  for  the  file  and  an 
array  of  indices  that  point  to  the  data  blocks  for  the  file.  For  the  purposes  of  this  section  we 
aasu.ne  that  the  first  8  blocks  of  the  file  are  directly  referenced  by  values  stored  in  the  inode 
structure  itself*.  The  inode  structure  may  also  contain  references  to  indirect  blocks  containing 
further  data  block  indices  In  a  file  system  with  a  512  byte  block  siie,  a  singly  indirect  block  con¬ 
tains  128  further  block  addresses,  a  doubly  indirect  block  contains  128  addresses  of  further  single 

indirect  blocks,  and  a  triply  indirect  block  contains  128  addresses  of  further  doubly  indirect 
blocks. 


A  traditional  150  megabyte  UNIX  file  system  consists  of  4  megabytes  of  inodes  followed  by 
146  megabytes  of  data  This  organitation  segregates  the  inode  information  from  the  data-  thus 
accessing  a  file  normally  incurs  a  long  seek  from  its  inode  to  its  data.  Files  in  a  single  directory 
are  not  typically  allocated  slots  in  consecutive  locations  in  the  4  megabytes  of  inodes,  causing 

many  non-ronsecutive  blocks  to  be  accessed  when  executing  operation  on  all  the  files  in  a  direc¬ 
tory. 

The  allocation  of  data  blocks  to  files  is  also  suboptimum.  The  traditional  file  Bystem  never 
transfers  more  than  512  bytes  per  disk  transaction  and  often  finds  that  the  next  sequential  data 
block  is  not  on  the  aame  cylinder,  forcing  aeeks  between  512  byte  transfers.  The  combination  of 
the  smal’  block  aiie,  limited  read-ahead  in  the  system,  and  many  aeeks  severely  limits  file  system 
throughput. 

The  first  work  at  Berkeley  on  the  UNIX  file  system  attempted  to  improve  both  reliability 
and  throughput.  The  reliability  was  improved  by  changing  the  file  system  so  that  all 
modifications  of  critical  information  were  staged  so  that  they  could  either  be  completed  or 
repaired  cleanly  by  a  program  after  a  crash  [Xowalski78).  The  file  system  performance  was 
improved  by  a  factor  of  more  than  two  by  changing  the  basic  block  site  from  512  to  1024  bytes. 
The  increase  was  because  of  two  factors;  each  disk  transfer  accessed  twice  as  much  data  and 
most  files  could  be  described  without  need  to  access  through  any  indirect  blocks  since  the -direct 
blocks  contained  twice  as  much  data.  The  file  system  with  these  changes  will  henceforth  be 
referred  to  as  the  old  file  system. 

This  performance  improvement  gave  a  strong  indication  that  increasing  the  block  site  was  a 
good  method  for  improving  throughput.  Although  the  throughput  had  doubled,  the  old  file  sys¬ 
tem  was  still  using  only  about  four  percent  of  the  disk  bandwidth.  The  main  problem  was  that 
although  the  free  list  was  initially  ordered  for  optimal  access,  it  quickly  became  scrambled  as  files 
were  created  and  removed  Eventually  the  free  list  became  entirely  random  causing  files  to  have 
their  blocks  allocated  randomly  over  the  disk.  This  forced  the  disk  to  seek  before  every  block 
access  Although  old  file  systems  provided  transfer  rates  of  up  to  175  kilobytes  per  second  when 
they  were  first  created,  this  rate  deteriorated  to  SO  kilobytes  per  second  after  a  few  weeks  of 
moderate  use  because  of  randomisation  of  their  free  block  list.  There  was  no  way  of  storing  the 
performance  an  old  file  system  except  to  dump,  rebuild,  and  restore  the  file  system.  Another  pos¬ 
sibility  would  be  to  have  a  process  that  periodically  reorganised  the  data  on  the  disk  to  restore 
locality  as  suggested  by  [Maruyama76]. 


S.  New  file  isystem  organisation 

As  in  the  old  file  system  organisation  each  disk  drive  contains  one  or  more  file  systems.  A 
file  system  is  described  by  its  super-block,  that  is  located  at  the  beginning  of  its  disk  partition. 
Because  th*  super-block  contains  critical  data  it  is  replicated  to  protect  against  catastrophic  loss. 
This  is  done  at  tfc?  time  that  the  file  system  is  created;  since  the  super-block  data  does  not 
change,  the  copies  need  not  be  referenced  unless  a  head  crash  or  other  hard  disk  error  causes  the 
default  super-block  to  be  unusable. 


•  The  situs]  lumber  m*j  »a rj  from  system  to  system,  but  is  ususlly  in  tbe  range  M3. 
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To  insure  that  it  is  possible  to  create  files  as  large  as  2|32  bytes  with  only  two  levels  of 
indirection,  the  minimum  aixe  of  a  file  system  block  is  4096  bytes.  The  site  of  file  system  blocks 
can  be  any  power  of  two  greater  than  or  equal  to  4096.  The  block  site  of  the  file  system  is  main¬ 
tained  in  the  super-block  so  it  is  possible  for  file  systems  with  different  block  sites  to  be  accessible 
simultaneously  on  the  same  system.  The  block  site  must  be  decided  at  the  time  that  the  file  sys¬ 
tem  ■  created;  it  cannot  be  subsequently  changed  without  .ebuilding  the  file  system.  • 

The  new  file  system  organisation  partitions  the  disk  into  one  or  more  areas  called  tf tinder 
grosps  .  A  cylinder  group  is  comprised  of  one  or  icore  consecutive  cylinders  on  a  disk.  Associ¬ 
ated  with  each  cylinder  group  is  some  bookkeeping  information  that  includes  a  redundant  copy  of 
the  supei^block,  space  for  inodes,  a  bit  map  describing  available  blocks  in  the  cylinder  group,  and 
summary  information  describing  the  usage  of  data  blocks  within  the  cylinder  group.  For  each 
cylinder  group  a  static  number  of  inodes  is  allocated  at  file  system  eieation  time.  The  current 
policy  is  tc  allocate  one  inode  for  each  2048  bytes  of  disk  space,  expecting  this  to  be  far  more 
than  will  ever  be  needed. 

All  the  cylinder  group  bookkeeping  information  could  be  placed  at  the  beginning  of  each 
cylinder  group.  However  if  this  approach  were  used,  all  the  redundaot  information  would  be  on 
the  top  platter.  Thus  a  single  hardware  failure  that  destroyed  the  top  platter  could  cause  the  loss 
of  all  copies  of  the  redundant  super-blocks.  Thus  the  cylinder  group  bookkeeping  information 
begins  at  a  floating  offset  from  the  beginning  of  the  cylinder  group.  The  offset  for  each  successive 
cylinder  group  is  calculated  to  be  about  one  track  further  from  the  beginning  of  the  cylinder 
group.  In  this  way  the  redundant  information  spirals  down  into  the  pack  so  that  any  single  track, 
cylinder,  or  platter  can  be  lost  without  losing  all  copies  of  the  super-blocks.  Except  for  the  first 
cylinder’ group,  the  space  between  the  beginning  of  the  cylinder  group  and  the  beginning  of  the 
cylinder  group  information  is  used  for  data  blocks. t 

1.1.  Optimising  atorage  utilisation 

Data  is  laid  out  so  that  larger  blocks  can  be  transferred  in  a  single  disk  transfer,  greatly 
increasing  file  system  throughput.  As  an  example,  consider  a  file  in  the  new  file  system  composed 
of  4096  byte  data  blocks.  In  the  old  file  system  this  file  would  be  composed  of  1024  byte  blocks. 
By  increasing  the  block  site,  disk  accesses  in  the  new  file  system  may  transfer  up  to  four  times  as 
much  information  per  disk  transaction.  In  large  files,  several  <096  byte  blocks  may  be  allocated 
from  the  same  cylinder  so  that  even  larger  data  transfers  are  possible  befo;*e  initiating  a  seek. 

The  main  problem  with  bigger  blocks  is  that  most  UNIX  file  systems  are  composed  of  many 
•mall  files.  A  uniformly  large  block  aite  wastes  space.  Table  1  shows  the  effect  of  file  system 
block  site  cm  the  amount  of  wasted  space  in  the  file  system.  The  machine  measured  to  obtain 
these  figures  is  one  of  our  time  sharing  systems  that  has  roughly  1.2  Gigabyte  of  on-line  storage^ 
The  measurements  are  based  on  the  active  user  file  systems  containing  about  920  megabytes  of 
formated  space. 


Space  used 

%  waste 

Organisation 

775.2  Mb 

0.0 

Data  only,  no  separation  between  files 

807.8  Mb 

4.2 

Data  only,  each  file  starts  on  512  byte  boundary 

828.7  Mb 

6.9 

512  byte  block  UNIX  file  system 

866.5  Mb 

11.8 

1024  byte  block  UNL  file  system 

948.5  Mb 

22.4 

2048  byte  block  UNIX  file  system 

1128.3  Mb 

45.6 

4096  byte  block  UNIX  file  system 

Table  1  -  Amount  of  wasted  space  as  a  function  of  olock  site. 


t  While  it  appear*  that  the  flnt  cylinder  «roup  could  he  laid  ont  with  iu  toper-block  at  the  "kaawn"  to*****1- 
tb»  would  not  work  Tor  file  tytumt  with  block.  ti".  of  16K  or  yeater,  bec.ute  o t  tb.  Carnot  that  the 
cylinder  group  information  mu*t  begin  it  *  block  bounduy. 


\ 
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WMt*d  “  “*“UTt<1  “  the  percentage  of  space  on  the  disk  not  containing  user  data  As 

K, ,  J,**  °°  the,f“^  increases,  the  waste  rises  quickly,  to  an  intolerable  45.6%  waste  with 
4066  byte  file  system  blocks. 


To  be  able  to  use  large  blocks  without  undue  waste,  small  files  must  be  stored  in  a  more 
efficient  way.  The  new  ale  system  accomplishes  this  goal  by  allowing  the  division  o£*  single  file 
system  block  into  one  or  more  fragment t .  The  file  system  fragment  site  is  specified  at  the  time 
that  the  file  system  is  created;  each  file  system  block  can  be  optionally  broken  into  2,  4,  or  8  frag¬ 
ments,  each  of  which  is  addressable.  The  lower  bound  on  the  sire  of  these  fragments  is  con¬ 
strained  by  the  disk  sector  site,  typically  512  bytes.  The  block  map  associated  with  each  cylinder 
group  records  the  space  availability  at  the  fragment  level;  to  determine  block  availability,  aligned 
fragments  are  examined.  Figure  1  shows  a  piece  of  a  map  from  a  4096/1024  file  system. 


Bits  in  map 

xxxx 

xxoo 

ooxx 

0000 

Fragment  numbers 

0-3 

4-7 

8-11 

12-15 

Block  numbers 

0 

1 

2 

3 

Figure  1  -  Example  layout  of  blocks  and  fragments  in  a  4096/1024  file  system. 

Each  bit  in  the  map  reeords  the  states  of  a  fragment;  an  “X”  .bows  that  the  fragment  is  in  use 
while  a  O  shows  that  the  fragment  is  available  for  allocation.  In  this  example,  fragments  0-s’ 
10,  and  11  are  in  use,  while  fragments  6-9,  and  12-15  are  free.  Fragments  of  adjoining  blocki 
cannot  be  used  as  ‘  block,  even  if  they  are  large  enough.  In  this  example,  fragments  6-9  cannot 
be  coalesced  into  a  block;  only  fragments  12-15  are  available  for  allocation  as  a  block. 

On  a  file  system  with  a  block  size  of  4096  bytes  and  a  fragment  site  of  1024  bytes  a  file  is 
repented  by  sere  or  more  4096  byte  blocks  of  data,  and  po«ibly  a  single  fragmented  block  If 
a  file  system  block  must  be  fragmented  to  obtain  space  for  a  small  amount  of  data,  the  remainder 
of  the  block  is  made  available  for  allocation  to  other  files.  As  an  example  consider  an  11000  byte 
file  stored  on  a  4096/1024  byte  file  system.  This  file  would  uses  two  full  size  blocks  and  a  8072 
byte  fragment.  If  no  8072  byte  fragments  are  available  at  the  time  the  file  is  created,  a  full  size 
block  is  split  yielding  the  necessary  3072  byte  fragment  and  an  unused  1024  byte  fragment  This 
remaining  fragment  can  be  allocated  to  another  file  as  needed. 

The  granularity  of  allocation  is  the  write  system  call.  Each  time  data  is  written  to  a  file, 
tne  system  checks  to  see  if  the  size  of  the  file  has  increased*.  If  the  file  needs  to  hold  the  new 
data,  one  of  three  conditions  exists: 

1)  There  is  enough  space  left  in  an  already  allocated  block  to  hold  the  new  data.  The  new 
data  is  written  into  the  available  space  in  the  block. 

2)  Nothing  has  been  allocated.  If  the  new  data  contains  more  than  4096  bytes,  a  4096  byte 

block  is  allocated  and  the  first  4096  bytes  of  new  data  is  written  there.  This  process  is 

repeated  until  less  than  4096  bytes  of  new  data  remain.  If  the  remaining  new  data  to  be 
written  will  fit  in  three  or  fewer  1024  byte  pieces,  an  unallocated  fragment  is  located,  other¬ 
wise  a  4096  byte  block  is  located.  The  new  data  is  written  into  the  located  piece. 

3)  A  fragment  has  been  allocated.  If  the  number  of  bytes  in  the  new  data  plus  the  number  of 

bytes  already  in  the  fragment  exceeds  4096  bytes,  a  4096  byte  block  is  allocated.  The  con¬ 

tents  of  the  fragment  is  copied  to  the  beginning  of  the  block  and  the  remainder  of  the  block 
is  filled  with  the  new  data.  The  process  then  continues  as  in  (2)  above.  If  the  number  of 
bytes  in  the  new  data  plus  the  number  of  bytes  already  in  the  fragment  will  fit  in  three  or 
fewer  1024  byte  pieces,  an  unallocated  fragment  is  located,  otherwise  a  4096  byte  block  is 
located.  The  contents  of  the  previous  fragment  appended  with  the  new  data  is  written  into 
the  allocated  piece. 


•  A  program  may  he  overwriting  data  in  the  middle  at  an  existing  lie  in  which  ease  space  will  already  he  allo¬ 
cated 
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The  problem  with  allowing  only  a  «nfle  figment  oo  a  4006/1024  byte  file  system  is  that 
d«U  may  be  potentially  copied  op  to  three  time*  as  iu  requirement*  grow  from  a  “?•" 

meet  to  a  *048  byte  fragment,  then  a  *072  byte  fragment,  and  finally  a  4006  byte  block  The 
fraxment  reallocation  can  be  avoided  If  the  veer  program  write*  a  full  block  at  a  tune,  except  for 
a  partial  block  at  the  end  of  the  file.  Because  file  systems  with  different  block  msec  my  coexist 
or  the  name  system,  the  file  system  interface  been  extended  to  provide  the  ability  to  determine 
the  optimal  eise  fer  a  read  or  write.  For  file*  the  optimal  rite  la  the  block  sue  of  the  Ale  system 
on  which  the  file  w  being  accented .  For  other  objects,  *ucb  at  pipe*  and  eoekeU,  the  optimal  sue 
la  the  underlying  buffer  aise.  This  feature  ia  used  by  the  Standard  knput/Outpot  Library,  *  P«k* 
mc  wed  by  meet  user  programs.  This  feature  ia  alto  used  by  certain  syrtem  *tibUea  each  at 
archivers  and  loaders  that  do  their  own  input  and  output  management  and  need  the  highest  possi¬ 
ble  file  system  bandwidth 

The  space  overhead  in  the  4096/1024  byte  new  file  system  organisation  is  smpirically 
observed  to  be  about  the  same  as  in  the  1024  byte  old  file  system  organisation.  A  file  system 
with  4006  byte  blocks  and  612  byte  fragments  bat  about  the  tame  amount  of  space  overhead I  m 
the  612  byte  block  UNIX  file  system  The  new  file  system  is  more  space  efficiwt  than  the  612 
byte  or  1024  byte  file  system*  in  that  it  cses  the  same  amount  of  space  for  small  files  while  requir¬ 
ing  lees  indexing  information  for  large  files.  This  saving*  ia  offset  by  the  need  to  use  more  space 
for  keeping  track  of  available  free  block*.  The  net  result  ia  about  the  same  disk  utilisation  when 
the  new  file  systems  fragment  site  equals  the  old  file  system*  block  sue. 

In  order  for  the  layout  policies  to  be  effective,  the  disk  cannot  be  kept  completely  full.  Each 
file  system  maintains  a  parameter  that  gives  the  minimum  acceptable  percentage  of  file  system 
blocks  that  can  be  free  If  the  the  number  of  free  blocks  drops  below  this  level  only  the  system 
administrator  can  continue  to  allocate  blocks.  The  value  of  this  parameter  can  be  changed  at  any 
time  even  when  the  file  syrtem  ia  mounted  and  active.  The  tranrfer  rate,  to  be  given  in  section  4 
were  measured  on  file  systems  kept  less  than  90%  full.  If  the  reserve  of  fret  blocks  w  set  to  aero, 
the  file  system  throughput  rate  tends  to  be  cut  in  half,  because  of  the  inability  of  the  file  system 
to  localise  the  blocks  in  s  file.  If  the  performance  is  impaired  because  of  overfilling,  it  may  be 
restored  by  removing  enough  file,  to  obtain  10%  free  space.  Act-  speed  for  file,  created  during 
periods  of  little  free  space  can  be  restored  by  recreating  them  once  enough  space  ■  available.  T  e 
amount  of  free  space  maintained  must  be  added  to  the  percentage  of  wa^te  when  rompanng  the 
organisations  given  in  Table  1.  Thus,  a  site  running  the  old  1024  byte  U>raf^^mwaj»tes 
1?8%  of  the  space  and  one  could  expect  to  fit  the  same  amount  of  data  into  a  4006/612  byte  new 
file  system  with  6%  free  space,  since  s  612  byte  old  file  system  wasted  6.0%  of  the  space. 

1.2.  File  system  parameterisatlon 

Except  for  the  initial  creation  of  free  lir*.  V  a* /  * 

the  underlying  hardware.  It  has  no  UmmHkr  «***  ;i !  3*3 

“L. or  th>  btt4.«  .i  -itb  it.  A  poi  ct  ^ 

p»™m«um«  th,  proe—or  «*P»bi]ili«i  li  n>  <  »•“£«'  cWt.mn.c  B>  tbM  bl^fc  b«  J 
c»ud  in  u>  optimum  «m6t  uutioo  d.f-r  Jeutmujr.  J>«» m«t«r.  urnd  ueiud.  tht^doftb.  pro- 
a,,  nipport  for  mu.-  «»»*>•.  *»<<  ebwKUmtic. 

morace  device*  Disk  technology  ia  cot  itantly  improving  and  a  given  installation  can  have  aever 

that  it  can  adapt  to  the  characteristics  of  the  disk  on  which  it  is  p  »Cvd 

For  mass  storage  device,  such  sa  disks,  the  new  file  qptem  tries  to  allocate  new  blocks  on 
the  same  evlinder  as  the  previous  block  in  the  aame  file.  Optimally,  these  new  block*  will  also  be 
well  positioned  rotationally.  The  diatance  between  “rotstionslly  optimal”  block*  vanes  flatly, 1 
^rr^^eVk  or  a  roUtionally  delayed  block  depending  on 

2-= Mgag 
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•ervice  this  interrupt  end  schedule  a  new  duk  transfer  depends  on  the  speed  of  the  main  proces¬ 
sor. 

The  physical  characteristics  of  each  disk  include  the  number  or  blocks  per  track  and  the 
rate  at  w  lich  the  disk  spins  The  allocation  policy  routines  use  this  information  to  calculate  the 
number  of  milliseconds  required  to  skip  over  a  block.  The  characteristics  of  the  processor  include 
fie  expected  time  to  schedule  an  interrupt.  Given  the  previous  block  allocated  to  a  file,  the  allo¬ 
cation  routines  calculate  the  number  of  blocks  to  skip  over  so  that  the  next  block  in^  file  will  be 
doming  into  position  under  the  disk  head  in  the  expected  amount  of  time  that  it  takes  to  start  a 
new  disk  transfer  operation.  For  programs  that  sequentially  access  large  amount*  of  data,  this 
Mrategy  minimises  the  amount  of  time  spent  waiting  for  the  disk  to  position  itself. 

To  ease  the  calculation  of  finding  rotationally  optimal  blocks,  the  cylinder  group  summary 
information  includes  a  count  of  the  availability  of  blocks  at  different  rotational  positions.  Eight 
rotational  poaitions  are  distinguished,  so  the  resolution  of  the  summary  information  is  2  mil¬ 
liseconds  for  a  typical  8800  revolution  per  minute  drive. 

The  parameter  that  defines  the  minimum  number  of  milliseconds  between  the  completion  of 
a  data  transfer  and  the  initiation  of  another  data  transfer  on  the  same  cylinder  can  be  changed  at 
any  time,  even  when  the  file  system  is  mounted  and  active.  If  a  file  system  ■  parameterised  to 
lay  out  blocks  with  rotational  separation  of  2  milliseconds,  and  the  disk  pack  is  then  moved  to  a 
system  that  has  a  processor  requiring  4  milliseconds  to  schedule  a  disk  operation,  the  throughput 
will  drop  precipitously  because  of  lost  disk  revolutions  on  nearly  every  block.  If  the  eventual  tar¬ 
get  machine  is  known,  the  file  system  can  be  paramtieriied  for  it  sven  though  it  is  initially 
created  on  a  different  processor.  Even  if  the  move  is  not  known  in  advance,  the  rotational  layout 
delay  can  be  reconfigured  after  the  disk  is  moved  so  that  all  further  allocation  is  done  based  on 
the  characteristics  of  the  new  host. 

tJ.  Layout  policies 

The  file  system  policies  are  divided  into  two  distinct  parts.  At  the  top  level  are  global  poli¬ 
cies  that  use  file  system  wide  summary  information  to  make  decisions  regarding  the  placement  of 
new  inodes  and  data  blocks.  These  routines  are  responsible  for  deciding  the  placement  of  new 
directories  and  files.  They  also  calculate  rotationally  optimal  block  layouts,  and  decide  when  to 
force  a  long  seek  to  a  new  cylinder  group  because  there  are  insufficient  blocks  left  in  the  current 
cylinder  group  to  do  reasonable  layouts.  Below  the  global  policy  routines  are  the  local  allocation 
routines  that  use  a  locally  optima]  scheme  to  lay  out  data  blocks. 

Two  methods  for  improving  file  system  performance  are  to  increase  the  locality  of  reference 
to  minimise  seek  latency  as  described  by  [Trivedi80],  and  to  improve  the  layout  of  data  to  make 
larger  transfers  possible  as  describes  by  [Nevalainen77].  The  global  layout  policies  try  to  improve 
l  >  fort  nance  by  clustering  TilaU  J  information.  They  cannot  attempt  to  localise  all  data  refer- 
en.es,  but  must  a’;<>  try  to  spread  unrelated  data  among  different  cylinder  groups.  If  too  much 
localisation  is  attempted,  the  local  cylinder  group  may  run  out  of  space  forcing  the  data  to  be 
scattered  to  non-local  cylinder  groups.  Taken  to  an  extreme,  total  localisation  can  result  in  a  sin¬ 
gle  huge  cluster  of  data  resembling  the  old  file  system.  The  global  policies  try  to  balance  the  two 
conflicting  goals  of  localising  data  that  is  concurrently  accessed  while  spreading  out  unrelated 
data. 

One  alloc atable  resource  is  inodes.  Inodes  are  used  to  describe  both  files  and  directories. 
Files  in  a  directory  are  frequently  accessed  together.  For  example  the  “list  directory”  command 
often  accesses  the  inode  for  each  file  in  a  directory.  The  layout  policy  tries  to  place  all  the  files  in 
*  directory  in  the  sruie  cylinder  group.  To  ensure  that  files  are  allocated  throughout  the  disk,  a 
different  policy  is  used  for  directory  allocation.  A  new  directory  is  placed  in  the  cylinder  group 
that  has  a  greater  than  average  number  of  free  inodes,  and  the  fewest  number  of  directories  in  it 
already.  The  intent  of  this  policy  is  to  allow  the  file  clustering  policy  to  succeed  most  of  the  time 
The  allocation  of  inodes  within  a  cylinder  group  is  done  using  a  next  free  strategy.  Although  this 
allocates  the  inodes  randomly  within  a  cylinder  group,  all  the  inodes  for  each  cylinder  group  can 
be  read  with  4  to  8  disk  transfers  This  puts  a  small  and  constant  upper  bound  on  the  number  of 
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disk  transfer!  required  to  access  all  the  inodes  for  all  the  files  in  a  directory  as  compared  to  the 
old  file  system  where  typically,  one  disk  transfer  is  needed  to  get  the  inode  for  each  file  in  a  direc¬ 
tory 

The  other  major  resource  is  the  data  blocks.  Since  data  blocks  for  a  file  are  typically 
accessed  together,  the  policy  routines  try  to  place  all  the  data  blocks  for  a  file  in  the  same 
cylinder  group,  preferably  rotationally  optimally  on  the  aame  cylinder.  The  problem  with  allocat¬ 
ing  all  the  data  blocks  in  the  aame  cylinder  group  ia  that  large  files  will  quickly  ose  ap  available 
•pace  in  the  cylinder  group,  forcing  a  spill  over  to  otber  areas.  Using  up  all  the  space  in  a 
cylinder  group  has  the  added  drawback  that  future  allocations  for  any  file  in  the  cylinder  group 
will  also  spill  to  other  areas  Ideally  none  of  the  cylinder  groups  should  ever  become  completely 
foil.  The  solution  devised  is  to  redirect  block  allocation  to  a  newly  chosen  cylinder  group  when  a 
file  exceeds  32  kilobytes  and  at  every  megabyte  thereafter.  The  newly  chosen  cylinder  group  is 
selected  from  those  cylinder  groups  that  have  a  greater  than  average  number  of  free  blocks  left 
Although  big  files  tend  to  he  spread  out  over  the  disk,  a  megabyte  of  data  ia  typically  accessible 
before  a  long  seek  must  be  performed,  and  the  cost  of  one  long  seek  per  megabyte  is  small. 

The  global  policy  routines  call  local  allocation  routines  with  requests  for  specific  blocks 
The  local  allocation  routines  will  always  allocate  the  requested  block  if  it  is  free.  If  the  requested 
block  is  not  available,  the  allocator  allocates  a  free  block  of  the  requested  site  that  ia  rotationally 
closest  to  the  requested  block.  If  the  global  layout  policies  had  complete  information,  they  could 
always  request  unused  blocks  and  the  allocation  routines  would  be  reduced  to  simple  bookkeeping. 
However,  maintaining  complete  information  is  costly;  thus  the  implementation  of  the  global  lay¬ 
out  policy  uses  heuristic  guesses  based  on  partial  information. 

If  a  requested  block  is  not  available  the  local  allocator  uses  a  four  level  allocation  strategy: 

1)  Use  the  available  block  rotationally  closest  to  the  requested  block  on  the  same  cylinder. 

2)  If  there  are  no  blocks  available  on  the  same  cylinder,  use  a  block  within  the  same  cylinder 
group. 

3)  If  the  cylinder  group  is  entirely  full,  quadratically  rehash  among  the  cylinder  groups  looking 
for  a  free  block. 

4)  Finally  if  the  rehash  fails,  apply  an  exhaustive  search. 

The  use  of  quadratic  rehash  is  prompted  by  studies  of  symbol  table  strategies  used  in  pro¬ 
gramming  languages.  File  systems  that  are  parameterised  to  maintain  at  least  10%  free  space 
almost  never  use  this  strategy;  file  systems  that  are  run  without  maintaining  any  free  space  typi¬ 
cally  have  so  few  free  blocks  that  almost  any  allocation  is  random.  Consequently  the  most  impor¬ 
tant  characteristic  of  the  strategy  used  when  the  file  system  is  low  on  space  is  that  it  be  fast. 


4.  Performance 

Ultimately,  the  proof  of  the  elective’* ess  of  the  algorithms  described  in  the  previous  section 
is  the  long  term  performance  of  the  new  file  system. 

Our  empiric  studies  have  shown  that  tbe  inode  layout  policy  has  been  effective.  When  run¬ 
ning  the  ‘‘list  directory”  command  on  a  large  directory  that  itself  contains  many  directories,  tbe 
number  of  disk  accesses  for  inodes  is  cut  by  a  factor  of  two.  Tbe  improvements  are  even  more 
dramatic  for  large  directories  containing  only  files,  disk  accesses  for  inodes  being  cut  by  a  factor 
of  eight  This  is  most  encouraging  for  programs  sucb  as  spooling  daemons  that  access  many  small 
files,  since  these  programs  tend  to  flood  tbe  disk  request  queue  on  tbe  old  file  system 

Table  2  summarises  the  measured  throughput  of  the  new  file  system.  Several  comments 
need  to  be  made  about  tbe  .renditions  under  which  these  tests  were  run.  Tbe  test  programs  meas¬ 
ure  the  rate  that  user  programs  can  transfer  data  to  or  from  a  file  without  performing  any  pro¬ 
cessing  on  it.  These  programs  must  write  enough  data  to  insure  that  buffering  in  the  operating 
system  does  not  affect  the  results.  They  should  also  be  run  at  least  three  times  in  succession;  the 
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fint  to  get  the  system  into  a  known  state  and  the  second  two  to  insure  that  the  experiment  has 
stabilised  and  is  repeatable.  The  methodology  and  test  results  are  discursed  in  detail  in  [Kri- 
d]e83]f.  The  systems  were  running  multi-user  but  were  otherwise  quiescent.  There  was  no  con¬ 
tention  for  either  the  cpu  or  the  disk  arm.  The  only  difference  between  the  UNIBUS  and 
MASSBUS  tests  was  the  controller.  All  tests  used  an  Ampex  Capricorn  MO  Megabyte  Winchester 
disk.  As  Table  2  shows,  all  file  system  test  runs  were  on  a  VAX  11/750.  All  file  systems  had 
been  in  production  use  for  at  least  a  month  before  being  measured. 


Type  of 

File  System 

Processor  and 
Bus  Measured 

Speed 

F  -ad 

Bandwidth 

%  CPU 

old  1024 

750/UNIBUS 

20  Kbytes/sec 

20/1100  3% 

11% 

new  4006/1024 

750/UNIBUS 

221  Kbytes/sec 

221/1100  20% 

43% 

new  8102/1024 

750/UNIBUS 

233  Kbytes/sec 

233/1100  21% 

20% 

new  400V 1024 

750/MASSBUS 

466  Kbytes/sec 

466/1200  30% 

73% 

new  8102/1324 

750/MASSBUS 

466  Kbytes/sec 

466/1200  30% 

64% 

Table  2a  -  Reading  rates  of  the  old  and  new  UNIX  file  systems. 


Type  of  Processor  and 

File  System  Bus  Measured 

Write 

Speed  Bandwidth  %  CPU 

old  1024  750/UNIBUS 

new  4096/1024  750/UNIBUS 

new  8102/1024  750/UNIBUS 

new  4096/1024  750/MASSBUS 

new  8192/1024  750/MASSBUS 

48  Kbytes/aec  48/1100  4%  20% 

142  Kbytes/sec  142/1100  13%  43% 

215  Kbytes/sec  215/1100 19%  46% 

323  Kbytes/sec  123/1200  27%  04% 

466  Kbytes/sec  466/1200  39%  05% 

Table  2b  -  Writing  rates  of  the  old  and  new  UNIX  file  systems. 


Unlike  the  old  file  system,  the  transfer  rates  for  the  new  file  system  do  not  appear  to  change 
over  time.  The  throughput  rate  is  tied  much  more  strongly  to  the  amount  of  free  space  that  is 
maintained.  The  measurements  in  Table  2  were  based  on  a  file  system  run  with  10%  free  space. 
Synthetic  work  loads  suggest  the  performance  deteriorates  to  about  half  the  throughput  rates 
given  in  Table  2  when  no  free  space  is  maintained. 

The  percentage  of  bandwidth  given  in  Table  2  is  a  measure  of  the  effective  utilisation  of  the 
disk  by  the  file  system.  An  upper  bound  oo  the  transfer  rate  from  the  disk  is  measured  by  doing 
65536*  byte  reads  from  contiguous  tracks  on  the  disk.  The  bandwidth  is  calculated  by  comparing 
the  data  rates  the  file  system  is  able  to  achieve  as  a  percentage  bf  this  rate.  Using  this  metric, 
the  old  file  system  is  only  able  to  use  about  5-4%  of  the  disk  bandwidth,  while  the  new  file  system 
uses  np  to  30%  of  the  bandwidth. 

In  the  new  file  system,  the  reading  rate  is  always  at  least  as  fast  as  the  writing  rate.  This  is 
to  be  expected  since  the  kernel  must  do  more  work  when  allocating  blocks  than  when  simply 
reading  them.  Note  that  the  write  rates  are  about  the  aame  as  the  read  rates  in  the  8102  byte 
block  file  system;  the  write  rates  are  slower  than  the  read  rates  in  the  4006  byte  block  file  system. 
The  rlower  write  rates  occur  because  tb*  k'.inel  has  to  do  twice  as  many  rfi«k  allocations  per 
second,  and  the  processor  is  unable  to  keep  vp  with  the  disk  transfer  rate. 

In  contrast  the  old  file  system  is  about  50%  faster  at  writing  files  than  reading  them.  This 
is  because  the  write  system  call  is  asynchronous  s^d  the  kernel  can  generate  disk  transfer  requests 
much  faster  than  they  can  be  aerviced,  hence  disk  transfers  build  up  in  the  dialr  buffer  cache 
Because  the  disk  buffer  cache  is  sorted  by  minimum  seek  order,  the  average  seek  between  the 

t  A  UNIX  com  mud  that  Is  rimilar  to  tke  readies  test  that  wt  seed  is,  "<p  lit  /dtr/sall”,  vbsrt  "Dt”  Is  sight 
Mtgabytss  loaf 

•  This  limbrr,  SUM,  a  the  maximal  I/O  sis*  supported  by  tht  VAX  hardware;  H  Is  a  remaaat  of  the  system's 
PDP-11  aaewtry 
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scheduled  disk  writes  is  much  less  than  they  would  be  if  the  dvu*  blocks  are  written  out  in  the 
order  in  which  they  are  generated.  However  when  the  file  is  read,  the  rt*i  system  call  is  pro¬ 
ceed  synchronously  so  the  disk  blocks  must  be  retrieved  from  the  disk  in  the  order  in  which 
they  are  allocated  This  forces  the  disk  scheduler  to  do  long  seeks  resulting  in  a  lower  throughput 
rate. 

Tht  performance  of  the  new  file  system  is  currently  limited  by  a  memory  to  mamory  copy 
operation  because  it  transfers  data  from  the  disk  into  buffers  in  the  kernel  address  space  and  then 
spends  40%  of  the  processor  cycles  copying  these  buffets  to  user  address  space.  If  the  buffers  in 
both  address  spaces  are  properly  aligned,  this  transfer  can  be  affected  without  copying  by  using 
the  VAX  virtual  memory  management  hardware.  This  is  especially  desirable  when  large  am  ants  - 
data  are  to  be  transferred.  We  did  not  implement  this  because  it  would  change  the  semantics 
of  the  file  system  in  two  major  ways,  user  programs  would  be  reqv;red  to  allocate  buffers  on  page 
boundaries,  and  data  would  disappear  from  buffers  after  being  written. 

Greater  disk  throughput  could  be  achieved  by  rewriting  the  disk  drivers  to  chain  together 
kernel  buffers  This  would  allow  files  to  be  allocated  to  contiguous  disk  blocks  that  could  be  read 
in  a  tingle  disk  transaction.  Most  disks  contain  either  32  or  48  512  byte  sectors  per  track.  The 
inability  to  uae  contiguous  disk  blocks  effectively  limits  the  performance  on  these  disks  to  less 
than  fifty  percent  of  the  available  bandwidth.  Since  each  track  has  a  multiple  of  sixteen  sectors  it 
bolds  exactly  two  or  three  8182  byte  file  system  blocks,  or  four  or  aix  4086  byte  file  system  blocks. 
If  the  the  next  block  for  a  file  cannot  be  laid  out  contiguously,  then  the  minimum  spacing  to  the 
next  allocatable  block  on  any  platter  is  between  a  sixth  and  a  half  a  revolution.  The  implication 
of  this  is  that  the  best  possible  layout  without  contiguous  blocks  uses  orly  half  cf  the  bandwidth 
of  any  given  track.  If  each  track  contains  an  odd  number  of  sectors,  then  it  is  possible  to  resolve 
the  rotational  delay  to  any  number  of  sectors  by  finding  a  block  that  begins  at  the  desired  rota¬ 
tional  position  on  another  track.  The  reason  that  block  chaining  has  not  been  implemented  is 
because  it  would  require  rewriting  all  the  disk  drivers  in  the  system,  and  the  current  throughput 
rates  are  already  limited  by  the  speed  of  the  available  processors. 

Currently  only  one  block  is  sllocated  to  a  file  at  a  time.  A  technique  used  by  the  DEMOS 
file  system  when  it  finds  thst  a  file  it  growing  rapidly,  is  to  preallocate  se-’eral  blocks  at  once, 
releasing  them  when  the  file  is  closed  if  they  remain  unused.  By  batching  up  the  allocation  the 
system  cum  reduce  the  overhead  of  allocating  at  each  write,  and  it  can  cut  down  on  the  number  of 
disk  writes  needed  to  keep  the  block  pointers  on  the  disk  synchronised  with  the  block  allocation 
(Powell^. 


5.  FUe  system  functional  enhanesments 

The  speed  enhancement*  to  the  UNDC  file  system  did  not  require  any  changes  to  the  semio¬ 
tics  or  data  structures  viewed  by  the  users.  However  several  changes  have  been  generally  desired 
for  tome  time  but  have  not  been  introduced  because  they  would  require  users  to  dump  and  restore 
al!  vbeir  file  systems.  Since  the  new  file  system  already  requires  that  all  existing  file  systems  be 
dumped  and  restored,  these  functional  enhancements  have  been  introduced  at  this  time. 

£.1.  Long  file  names 

FUe  names  can  now  be  of  nearly  arbitrary  length.  The  only  user  programs  affected  by  this 
change  are  those  that  access  directories.  To  maintain  portability  among  UNDC  systems  that  are 
»ot  running  the  new  file  system,  a  set  of  directory  access  routines  have  been  introduced  that  pro- 
wide  a  uniform  interface  to  directories  on  both  old  and  new  systems. 

Directories  are  allocated  in  units  of  512  bytes  This  site  is  chosen  so  that  tach  allocation 
can  be  transferred  to  disk  in  a  single  atomic  operation.  Each  allocation  unit  contain* 
length  directory  entries.  Each  entry  is  wholly  contained  in  a  tingle  allocation  wut^  ae  first 
three  fields  of  a  directory  entry  art  fixed  and  contain  an  inode  number,  the  lcngu  «  the  entry, 
and  the  length  of  the  name  contained  in  the  entry.  Following  thi*  fixed  aise  information  is  the 
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null  terminated  name,  padded  to  a  4  byte  boundary.  The  maximum  length  of  a  name  in  a  direc¬ 
tory  it  currently  255  characters. 

I  ree  apace  in  a  directory  ia  held  by  entriea  that  have  a  record  length  that  exceeds  the  space 
required  by  the  directory  entry  itself.  All  the  bytes  in  a  directory  unit  are  claimed  by  the  direc¬ 
tory  entries  This  normally  results  in  the  last  entry  in  a  directory  b.ing  large.  When  entries  are 
deleted  from  a  directory,  the  apace  is  returned  to  the  previous  entry  in  the  tame  directory  unit  by 
increasing  its  length.  If  the  first  entry  of  a  directory  unit  ia  free,  then  its  inode  aumtier  k  set  to 
aero  to  show  that  it  ia  unallocated. 

File  locking 

The  old  file  system  had  no  provision  for  locking  files.  Processes  that  needed  to  synchroniie 
the  updates  of  a  file  had  to  creaU  a  separate  “lock”  file  to  synchronise  their  update*.  A  process 
would  try  to  create  a  “lock”  file.  If  the  creation  succeeded,  then  it  could  proceed  with  its  update, 
if  the  creation  failed,  then  it  would  wait,  and  try  again.  This  mechanism  had  three  drawbacks! 
Processes  consumed  CPU  time,  by  looping  over  attempts  to  create  locks.  Locks  were  left  lying 
around  following  system  crashes  and  had  to  be  cleaned  up  by  hand.  Finally,  proteases  running  as 
system  administrator  are  always  permitted  to  create  files,  so  they  had  to  use  a  different  mechan- 
km.  While  it  is  possible  to  get  around  all  these  problems,  the  solutions  are  not  straight-forward, 
so  a  mechanism  for  locking  files  has  been  added. 

The  most  general  schemes  allow  processes  to  concurrently  update  a  file.  Several  of  these 
techniques  are  discussed  in  [PetentonM],  A  simpler  technique  k  to  simply  serialise  access  with 
lock;.  To  attain  reasonable  efficiency,  certain  applications  require  the  ability  to  lock  pieces  of  a 
file.  Locking  down  to  the  byte  level  has  been  implemented  in  the  Onyx  file  system  by  [Bass8l]. 
However,  for  the  applications  that  currently  run  on  the  system,  a  mechanism  that  locks  at  the 
granularity  of  a  file  is  sufficient. 

Locking  schemes  fall  into  two  classes,  those  using  hard  locks  and  those  using  advisory  locks. 
The  primary  difference  between  advisory  locks  and  hard  locks  k  the  deckion  of  when  to  override 
them.  A  hard  lock  k  always  enforced  whenever  a  program  tries  to  .ccess  „  file;  an  advisory  lock  is 
only  applied  when  it  k  requested  by  a  program.  Thus  advisory  locks  are  only  effective  when  all 
programs  accessing  a  file  use  the  locking  scheme.  With  hard  locks  there  must  be  some  override 
policy  implemented  in  the  kernel,  with  advisory  locks  the  policy  k  implemented  by  the  user  pro¬ 
grams.  In  the  UNIX  system,  programs  with  system  administrator  privilege  can  override  any  pro¬ 
tection  scheme  Because  many  of  the  programs  that  need  to  use  locks  run  as  system  administra¬ 
tors,  we  chose  to  implement  advL*'  \>  locks  rather  than  create  a  protection  scheme  that  was  con¬ 
trary  to  the  UNIX  philosophy  f  r  C'.uld  not  be  used  by  system  administration  programs. 

The  file  locking  facilities  allow  cooperating  programs  to  apply  advisory  sksrsd  or  ssc/setve 
locks  on  files.  Only  one  process  has  an  exclusive  lock  on  a  file  while  multiple  shared  locks  may  be 
present.  Both  shared  and  exclusive  locks  cannot  be  present  cm  a  file  at  the  a«jre  time.  If  any 
lock  k  requested  when  another  process  holds  an  exclusive  lock,  or  an  exclusive  lock  k  requested 
when  another  process  holds  any  lock,  the  open  will  block  until  the  lock  can  be  gained.  Because 
shared  and  exclusive  loclu  are  advisory  only,  even  if  a  process  has  obtained  a  lock  on  a  file, 
another  process  can  override  the  lock  by  opening  the  same  file  without  a  lock 

Locks  can  be  applied  or  removed  on  open  files,  so  that  locks  can  be  manipulated  without 
needing  to  close  and  reopen  the  file.  This  is  useful,  for  example,  when  a  process  wishes  to  open  a 
file  with  a  shared  lock  to  read  some  information,  to  determine  whether  an  update  k  required  It 
can  then  get  an  exclusive  lock  so  that  it  can  do  a  read,  modify,  and  write  to  update  the  file  in  a 
consistent  manner. 

A  request  for  a  lock  will  cause  the  process  to  block  if  the  lock  can  not  be  immediately 
obtained.  In  certain  instances  this  »  unsatisfactory.  For  example,  a  process  that  wanU  only  to 
check  if  a  lock  is  present  would  require  a  separate  mechanism  to  find  out  this  information.  Con¬ 
sequently,  a  process  nay  specify  that  its  locking  request  should  return  with  an  error  if  a  lock  can 
not  be  immediately  obtained.  Being  able  to  poll  for  a  lock  is  useful  to  “daemon”  processes  that 
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wuh  to  service  e  spooling  ere*  If  the  firrt  instance  of  the  deemon  locks  the  directory  where 
•pooling  tikes  piece,  leter  deemon  processes  esn  eesily  check  to  see  if  en  sctive  dr  emon  exists 
Since  the  lock  is  removed  when  the  process  exits  or  the  «ystem  ersshes,  there  is  ~..v  k  -a  *m  with 
unintentional  locks  files  that  must  be  cleared  by  hand. 

Almost  no  deadlock  detection  is  attempted.  The  only  deadlock  detection  ■*»'.'£  t»j  the  sys¬ 
tem  is  that  the  file  descriptor  to  which  a  lock  is  applied  does  not  currently  have  a  lock  of  the 
same  type  (i.c.  the  second  of  two  successive  calls  to  apply  a  lock  of  the  same  type  will  fail).  Thus 
a  process  can  deadlock  itself  by  requesting  locks  on  two  separate  file  descriptors  for  the  same 
object.  / 

5.8.  Symbolic  links 

The  612  byte  UNIX  file  system  allows  multiple  directory  sn tries  in  the  same  file  system  to 
reference  a  single  file  The  link  concept  is  fundamental;  files  do  not  live  in  directories,  but  exist 
separately  and  -re  referenced  by  links.  When  ell  the  links  art  removed,  the  file  is  deallocated 
This  style  of  links  does  not  allow  references  across  physical  file  systems,  nor  does  it  support  inters 
machine  linkage.  To  avoid  these  limitations  tj miolic  links  have  been  added  similar  to  the  scheme 
used  by  Multics  (Feiertag7l). 

A  symbolic  link  is  implemented  as  a  file  that  contains  a  pathname.  When  the  system 
encounters  a  symbolic  link  while  interpreting  a  component  of  a  pathname,  the  content*  of  the 
symbolic  link  is  prepended  to  the  rest  of  the  pathname,  and  this  name  is  interpreted  to  yield  the 
resulting  pathname.  If  the  symbolic  link  contains  an  absolute  pathname,  the  absolute  pathname 
is  used,  otherwise  the  contents  of  the  symbolic  link  is  evaluated  relative  to  the  location  of  the  link 
in  the  file  hierarchy. 

Normally  programs  do  not  want  to  be  aware  that  there  is  a  symbolic  link  in  a  pathname 
that  they  are  using.  However  certain  system  utilities  must  be  able  to  detect  and  manipulate  sym¬ 
bolic  links.  Three  new  system  calls  provide  the  ability  to  detect,  read,  and  write  symbolic  links, 
and  seven  system  utilities  were  modified  to  use  these  calls. 

In  future  Berkeley  software  distributions  it  will  be  possible  to  mount  file  systems  from  other 
machines  within  a  local  file  system.  When  this  occurs,  it  will  be  possible  to  create  symbolic  links 
that  span  machines. 

5.4.  Rename 

Programs  that  create  new  versions  of  data  files  typically  create  the  new  version  as  a  tem¬ 
porary  file  and  then  rename  the  temporary  file  with  the  original  name  of  the  data  file.  In  the  old 
UNIX  file  systems  the  renaming  required  three  calls  to  tn«  system.  If  the  program  were  in*er- 
rupted  or  the  system  crashed  between  these  calls,  the  data  fk'e  could  be  left  with  only  it*  tem¬ 
porary  name.  To  eliminate  this  possibility  a  tingle  system  call  ^as  been  added  that  performs  the 
rename  in  an  atomic  fashion  to  guarantee  the  existence  of  the  original  name. 

In  addition,  the  rename  facility  allows  directories  to  be  moved  around  in  the  directory  tree 
hierarchy.  The  rename  system  call  performs  special  validation  checks  to  insure  that  the  directory 
tree  structure  is  not  corrupted  by  the  creation  of  loops  or  inaccessible  directories.  Such  corrup¬ 
tion  would  occur  if  a  parent  directory  were  moved  into  one  of  it*  descendant*  The  validation 
check  requires  tracing  the  ancestry  of  the  target  directory  to  insure  that  it  does  not  include  the 
directory  being  moved. 

5.5.  Quota* 

The  UNIX  system  bas  traditionally  attempted  to  share  all  available  resources  to  the  greatest 
extent  possible.  Thus  any  single  user  can  allocate  all  the  available  space  in  the  file  system  In 
certain  environments  this  is  unacceptable.  Consequently,  a  quota  mechanism  has  been  added  for 
restricting  the  amount  of  file  system  resources  that  a  user  can  obtain.  The  quota  mechanism  sets 
limits  on  both  the  number  of  files  and  the  number  of  disk  blocks  that  a  user  may  allocate.  A 
separate  quota  can  be  set  for  each  user  on  each  file  system.  Each  resource  is  given  both  a  hard 
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Appendix  C  -  Networking  Implementation 
i.  Introduction 

Thi*  report  describes  the  internal  structure  of  facilities  added  to  the  4.2BSD  version  of  the 
UNIX  operating  system  for  the  VAX.  The  system  facilities  provide  a  uniform  user  interface  to 
networking  within  UNIX.  In  addition,  the  implementation  introduces  a  structure  f&r  network 
communications  which  may  be  used  by  system  implementors  in  adding  new  networking  facilities 
The  internal  structure  is  not  risible  to  the  user,  rather  it  is  intended  to  aid  implementors  of  com¬ 
munication  protocols  and  network  services  by  providing  a  framework  which  promotes  code  shar¬ 
ing  and  minimises  implementation  effort. 

The  reader  is  expected  to  be  familiar  with  the  C  programming  language  and  system  inter¬ 
face,  as  described  in  the  tody  of  this  report.  Basic  understanding  of  network  communication  con¬ 
cepts  is  assumed;  where  required  any  additional  ideas  are  introduced. 

The  remainder  of  this  document  provides  a  description  of  the  system  internals,  avoiding, 
when  possible,  those  portions  which  are  utilised  only  by  the  interprocess  communication  facilities. 


2.  Overview 

If  we  consider  the  Intemstionsl  Standards  Organisation’s  (ISO)  Open  System  Interconnec¬ 
tion  (OSI)  model  of  network  communication  [IS081]  [Zimmermann80],  the  networking  facilities 
described  here  correspond  to  a  portion  of  the  session  layer  (layer  S)  and  all  of  the  transport  and 
network  layers  (layers  2  and  1,  respectively). 

The  network  layer  provides  possibly  imperfect  dita  transport  services  with  minimal  address¬ 
ing  structure.  Addressing  at  this  level  is  normally  host  to  boat,  with  implicit  or  explicit  routing 
optionally  supported  by  the  communicating  agents. 
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At  the  transport  layer  the  notions  of  reliable  transfer,  data  sequencing,  flow  control,  and  ser¬ 
vice  addressing  are  normally  included.  Reliability  is  usually  managed  by  explicit  acknowledge¬ 
ment  of  data  delivered.  Failure  to  acknowledge  a  transfer  results  in  retransmission  of  the  dats. 
Sequencing  may  be  handled  by  tagging  each  message  handed  to  the  network  layer  by  a  tequence 
mwmlcr  and  maintaining  state  at  the  endpoints  of  communication  to  utilise  received  sequence 
numbers  in  reordering  data  which  arrives  out  of  order. 

The  session  layer  facilities  may  provide  forms  of  addressing  which  are  mapped  into  formate 
required  b;  the  transport  layer,  service  authentication  and  client  authentication,  etc.  Various  sys¬ 
tems  provide  services  such  as  data  encryption  and  address  and  protocol  translation. 

The  following  sections  begin  by  describing  some  of  the  common  data  structures  and  utility 
routines,  then  examine  the  internal  layering.  The  contents  of  each  layer  and  its  interface  are  con¬ 
sidered.  Certain  of  the  interfaces  art  protocol  implementation  specific.  For  these  cases  examples 
have  been  drawn  from  the  Internet  [Ceri78]  protocol  family.  Later  sections  cover  routing  issues, 
the  design  of  the  raw  socket  interface  and  other  miscellaneous  topics. 


S.  Goals 

The  networking  system  was  designed  with  the  goal  of  supporting  multiple  protocol  fomilici 
•nd  addressing  styles.  This  required  information  to  be  “hidden"  in  eommon  data  structures 
which  could  be  manipulated  by  all  the  pieces  of  the  system,  but  which  required  interpretation 
only  by  the  protocols  which  “controlled"  it.  The  system  described  here  attempts  to  minimise  the 
use  of  shared  data  structures  to  those  Lept  by  s  suite  of  protocols  (s  protocol  /omip),  and  those 
used  for  rendesvous  between  "synchronous"  and  “asynchronous"  portions  of  the  system  (e.g 
queues  of  data  packets  are  filled  at  interrupt  time  and  emptied  based  on  user  requests). 
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A  major  goal  of  the  system  was  to  provide  a  framework  within  which  new  protocob  and 
hardware  could  be  easily  be  supported.  To  this  end,  a  great  deal  of  effort  has  been  extended  to 
create  utility  routines  which  hide  many  of  the  more  complex  and/or  hardware  dependent  chores 
of  networking  Later  sections  describe  the  utility  routines  and  the  underlying  data  structures  toey 

manipulate 


4.  Internal  address  representation 

•  Common  to  all  portions  of  the  system  are  two  data  structures.  These  structures  are  w«d  to 
represent  addresses  and  various  daU  objects.  Addresses,  internally  are  described  by  the  eoekeddr 

structure, 

struct  sockaddr  { 

short  sajamily;  /•  data  format  identifier  •/ 
char  sa_data[l4];  /•  address  •/ 

}; 

All  addresses  belor.g  to  one  or  more  s iirett  ftmUitt  which  dsfine  their  formst  and  Interpretation. 
The  ssfsmtf,  field  indicates  which  address  family  the  address  belongs  to,  the  ss_dsts  field  con¬ 
tains  the  actual  data  value.  The  site  of  the  data  field,  14  bytes,  was  selected  based  on  a  study  of 
current  address  formats*. 


B.  Memory  management 

A  single  mechanism  is  used  for  data  storage,  memory  buffers,  or  m»«/s.  An  mbuf  is  a  struc¬ 
ture  of  the  form: 


struct  mbuf  { 

struct  mbuf  *m_next; 

u_k>ng  m_off; 

short  mjen; 
short  m_type; 
u_char  m_dat[MLEN] ; 

struct  mbuf  •m_act; 


/•  next  buffer  in  chain  •/ 

/•  offset  of  data  •/ 

/•  amount  of  data  in  this  mbuf  •/ 

/•  mbuf  type  (accounting)  •/ 

/•  data  storage  •/ 

/•  link  in  higher-level  mbuf  list  •/ 


}; 

The  m  next  field  is  used  to  chain  mbufs  together  on  linked  lists,  while  the  fieW  ^ 

of  mbufs  to  be  accumulated.  By  convention,  the  mbufs  common  to  a  single  object  (for  example,  a 
packet)  are  chained  together  with  the  field,  while  groups  of  objects  are  linked  via  the 

m_*ct  field  (possibly  when  in  a  queue). 

Each  mbuf  has  a  small  data  area  for  storing  information,  mjet.  The  m Jen  field  indicates 
the  amount  of  data,  while  the  m_off  field  is  an  offset  to  the  beginning  of  the  data  from  the  baw  o 
the  mbuf.  Thus,  for  example,  the  macro  mtoi,  which  converts  a  pointer  to  an  mbuf  to  a  pointer 

to  the  data  stored  in  the  mbuf,  has  the  form 

#define  mtod(x.t)  ((t)((int){x)  (x)->m_off)) 

(no u  the  t  parameter,  a  C  type  cast,  is  used  to  cast  the  resultant  pointer  for  proper  assignment) 

In  addition  to  storing  data  directly  in  the  mbur.  data  area,  data  of  page  mtemnj  beaUobe 
stored  in  a  separate  area  of  memory.  The  mbuf  utility  routines  maintain  ‘  ^ 

purpose  and  manipulate  a  private  page  map  for  such  pages.  The  virtual  •*&*•*• »  *  ***  j** 
pages^precede  those  of  mbufs,  so  when  pages  of  data  art  separated  from  an  mbuf,  the  mbuf  dats 

.  ..mn..  of  tht  system  support  wisklt  k#»tk  wldrmti 
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offset  is  a  negative  value  An  array  of  reference  counts  on  pages  is  also  maintained  ao  that  copies 
of  pages  may  be  made  without  core  to  core  copying  (copies  are  created  simply  by  duplicating  the 
relevant  page  table  entries  in  the  data  page  map  and  incrementing  the  associated  reference  counts 
for  tne  pages).  Separate  data  pages  are  currently  used  only  when  copying  data  from  a  user  pro- 
ce«  into  the  kernel,  and  when  bringing  data  in  at  the  hardware  level.  Routines  which  manipulate 
mbufs  art  not  normally  aware  if  data  is  stored  directly  in  the  mbuf  data  array,  or  IfTt  is  kept  in 
separate  pages 

The  following  utility  routines  art  available  for  manipulating  mbuf  chains: 

■i  ""  m_copy(mO,  off,  len); 

The  m_ton  routine  create  a  copy  of  all,  or  part,  of  a  list  of  the  mbufs  in  enO.  Len  bytes  of 
data,  starting  eff  bytes  from  the  front  of  the  chain,  art  copied.  Where  possible,  reference 
count/,  on  pages  are  used  instead  of  core  to  core  copies.  The  original  mbuf  chain  must  have 
at  least  off  +  len  bytes  of  data  If  Un  is  specified  as  'l_COPYALL,  all  the  data  present, 
offset  as  before,  is  copied. 

m_cat(m,  n); 

The  mbuf  chain,  a,  is  appended  to  the  end  of  m  Where  possible,  compaction  is  performed. 
m^dj(m,  diff); 

The  mbuf  chain,  m  is  adjusted  in  site  by  ii  J  bytes.  If  iiff  u  non-negative,  kbytes  art 
shaved  off  the  front  of  the  mbuf  chain.  If  iiff  u  negative,  the  alteration  is  performed  from 
back  to  front.  No  space  is  reclaimed  in  this  operation,  alterations  are  accomplished  by 
changing  the  mje a  and  fields  of  mbufu. 

m  “  m_pullup(inO,  site); 

Alter  a  successful  call  to  n^t'.'np,  the  mbuf  at  the  head  of  the  returned  list,  m,  is 
guaranteed  to  have  at  least  me  bytes  of  data  in  contiguous  memory  (allowing  access  via  a 
pointer,  obtained  using  the  mtoi  macro).  If  the  original  data  was  less  than  sire  bytes  long, 
len  was  greater  than  the  site  of  an  mbuf  data  area  (112  bytes),  or  required  resource#  were 
unavailable,  m  is  0  and  the  original  mbuf  chain  »  deallocated. 

routine  is  particularly  useful  when  verifying  packet  header  lengths  on  reception.  For 
sxmnple,  if  a  packet  is  received  and  only  8  of  the  necesaary  16  bytes  required  for  a  valid 
packet  header  are  present  at  the  head  of  the  list  of  mbufs  representing  the  packet,  the 
remaining  8  bytes  may  be  “pulled  up”  with  a  single  m_pnUnp  call.  If  the  call  fails  the 
invalid  packet  will  have  been  discarded. 

By  insuring  mbufs  always  reside  on  128  byte  boundaries  it  is  possible  to  always  locate  the 
mbuf  associated  with  a  data  area  by  masking  off  the  low  bits  of  the  virtual  address.  This  allows 
modules  to  store  data  structures  in  mbufs  and  pan  them  around  without  concern  for  locating  the 
original  nbuf  when  it  comes  time  to  free  the  structure.  The  itom  macro  is  used  to  convert  a 
pointer  into  an  mbufs  data  area  to  a  pointer  to  the  mbuf, 

#define  u«om(x)  ((struct  mbuf  *X(int)x  &  ’  (MSIZE-1))) 

Mbufs  are  used  for  dynamically  allocated  data  structures  such  as  sockets,  as  well  as  memory 
allocated  for  packets.  Statistics  are  maintained  on  mbuf  usage  and  can  be  viewed  by  users  using 
the  ac<s<*t(l)  program. 


t.  Internal  layering 

The  internal  structure  of  the  network  system  is  divided  into  three  layers.  These  layers 
correspond  to  the  services  provided  by  the  socket  abstraction,  those  provided  by  the  communica¬ 
tion  protocols,  and  those  provided  by  the  hardware  interfaces.  The  communication  protocols  are 
normally  layered  Lito  two  or  more  individual  cooperating  layers,  though  they  are  collectively 
viewed  in  the  system  as  one  layer  providing  services  supportive  of  the  appropriate  socket  abstrac¬ 
tion. 
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The  following  sections  describe  the  properties  of  each  layer  in  the  system  and  the  interfaces 
each  mns.  conform  to 


6.1.  Socket  layer 

The  socket  layer  deals  with  the  interprocess  communications  facilities  provided  by  the  sys¬ 
tem.  A  socket  is  a  bidirectional  endpoint  of  eommunic  stion  which  is  “typed”  by  the  semantics  of 
communication  it  supports  The  system  calls  described  in  the  4-tBSD  Sftttm  Afisi3  are  used  to 
manipulate  sockets 

A  socket  consists  of  the  following  data  structure: 


struct  socket  { 

short  so_type; 
short  so_options, 
short  *o_linger; 
short  so_state, 
caddr_t  so_pcb; 
struct 
struct 
struct 


}; 


/•  generic  type  •/ 

/•  frjm  socket  call  •/ 

/•  time  to  linger  while  closing  •/ 

/*  internal  state  Sags  •/ 

/•  protocol  control  block  */ 
protoew  •so_proto;  /•  protocol  handle  •/ 
socket  *ao_head;  /•  back  pointer  to  accept  socket  •/ 
socket  *so_qO;  /•  queue  of  partial  connections  */ 

short  so_q01en;  /*  partials  on  so_qO  •/ 

struct  socket  •eo_q;  /•  queue  of  incoming  connections  •/ 

short  so_qlen;  /*  number  of  connections  oo  so_q  •/ 

short  so_qlimit;  /•  max  number  queued  connections  •/ 

struct  sockbuf  so_pnd,  /•  send  queue  •/ 

struct  sockbuf  sojrcv;  /•  receive  queue  •/ 

short  so_timeo;  /•  connection  timeout  •/ 

u_«hort  so_error;  /•  error  affecting  connection  •/ 

short  so_oobmark;  /•  chars  to  oob  mark  •/ 

short  so_pgrp;  /•  pgrp  for  signals  •/ 


Each  socket  contains  two  data  queues,  »o_rev  and  and  a  pointer  to  routines  which 

provide  supporting  services.  The  type  of  the  socket,  «o_type  is  defined  at  socket  creation  time  and 
used  in  selecting  those  services  which  art  appropriate  to  support  it.  The  supporting  protocol  is 
selected  at  socket  creation  time  and  recorded  in  the  socket  data  structure  for  later  use.  Protocols 
are  defined  by  a  table  of  procedures,  the  frototw  structure,  which  will  be  described  in  detail  later. 
A  pointer  to  a  protocol  specific  data  structure,  the  “protocol  control  block"  ir  also  present  in  the 
socket  structure.  Protocols  control  this  data  structure  and  it  normally  includes  a  back  pointer  to 
the  parent  socket  structure(s)  to  allow  easy  lookup  when  returning  information  to  a  user  (for 
example,  placing  an  error  number  in  the  tojtrror  field).  The  other  entries  in  the  socket  structure 
are  used  in  queueing  connection  requests,  validating  user  requests,  storing  socket  characteristics 
(e.g.  options  supplied  at  the  time  a  socket  is  created),  and  maintaining  a  socket’s  state. 

Processes  “rendexvous  at  a  socket”  in  many  instances.  For  instance,  when  a  process  wishes 
to  extract  data  from  a  socket’s  receive  queue  snd  it  is  empty,  or  lacks  sufficient  data  to  satisfy  the 
request,  the  process  blocks,  supplying  the  address  of  the  receive  queue  as  an  “wait  channel’  to  be 
useu  in  notification.  When  data  arrives  for  the  process  and  is  placed  in  the  socket’s  queue,  the 
blocked  process  is  identified  by  the  fact  it  is  waiting  “on  the  queue”. 

6.1.1.  Socket  state 

A  socket’s  state  is  defined  from  the  following 
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^define 

SS_NOFDREF 

0x001 

^define 

SSJSCONNECTED 

0x002 

^define 

SSJSCONNECTING 

0x004 

^define 

SS  DISCONNECTING 

0x008 

^define 

SS.OANTSENDMORE 

0x010 

^define 

SS_CANTRCVMORE 

0x020 

^define 

SS.CONN  AWAITING 

0x040 

^define 

SS  _RCVATMARK 

0x080 

^define 

SSJPRTV 

0x100 

^define 

SS_NB10 

0x200 

^define 

SS_ASYNC 

0x400 

III!  UM  jljllll.  J  . 
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/•  no  file  table  ref  any  more  •/ 

/•  eocket  connected  to  a  peer  •/ 

/•  in  proceae  of  connecting  to  peer  •/ 

/•  in  process  of  disconnecting  k  I 
(•  can’t  send  more  data  to  peer  -/ 

/•  can’t  receive  more  datagram  peer  •/ 

/•  connection*  awaiting  acceptance  */ 

/•  at  mark  on  input  •/ 

/»  privileged  •/  ' 

/•  non-blocking  ops  •/ 

/•  asyne  i/o  notify  •/ 

The  state  of  a  socket  is  manipulated  both  by  the  protocols  and  the  user  (through  system 
calls).  When  a  socket  is  created  the  state  is  defined  based  on  the  type  of  input/output  the  user 
wishes  to  perform.  “Non-blocking’’  I/O  implies  a  process  should  never  be  blockeu  to  await 
resources  Instead,  any  call  which  would  block  returns  prematurely  with  the  error  EWOULD- 
BLOCK  (the  service  request  may  be  partially  fulfilled,  e.g.  a  request  for  more  data  than  is 
present). 

If  a  process  requested  “asynchronous”  notification  of  events  related  to  the  socket  the  SIG10 
Cigna]  is  posted  to  the  process.  An  event  is  a  change  in  the  socket’s  state,  examples  of  such  oceu- 
rances  are:  space  becoming  available  in  the  send  queue,  new  data  available  in  the  receive  queue, 
connection  establishment  or  disestablishment,  etc. 

A  socket  may  be  marked  “priviledged”  if  it  was  created  by  the  super-user.  Only  priviledged 
sockets  may  send  broadcast  packets,  or  bind  addresses  in  priviledged  portions  of  an  address  space. 

• 

6.1.2.  Socket  data  queues 

A  socket’s  data  queue  contain*  a  pointer  to  the  data  stored  in  the  queue  and  other  entries 
related  to  the  management  of  the  data  The  following  structure  defines  a  data  queue: 

struct  sockbuf  { 

short  eb_cc;  /•  actual  chars  in  buffer  •/ 

short  »b_Liwat,  /•  max  actual  char  count  •/ 

short  ab_mbcnt;  /•  chars  of  mbufs  used  •/ 

short  sb_mbmax;/*  max  char*  of  mbufs  to  use  •/ 

short  sbjowat;  /•  low  water  mark  •/ 

short  sb_timeo;  /*  timeout  •/ 

struct  mbuf  *3b_mb;  /•  the  mbuf  chain  •/ 

struct  proc  •sbjsel;  /•  process  selecting  read/write  •/ 

short  sbjlags;  /•  flags,  see  below  •/ 

>; 

D*ta  *  stored  in  a  queue  as  a  chain  of  mbufs.  The  actual  count  of  characters  as  well  as 
high  and  low  water  marks  are  used  by  the  protocols  in  controlling  the  flow  of  data.  The  socket 
routines  cooperate  in  implementing  the  flow  control  policy  by  blocking  a  process  when  it  requests 
to  send  data  and  the  high  water  mark  has  been  reached,  or  when  it  requests  to  receive  data  and 
less  than  the  low  water  mark  »  present  (assuming  non-blocking  I/O  has  not  been  specified). 

When  a  socket  is  created,  the  supporting  protocol  “reserves”  space  for  the  send  and  receive 
queues  of  the  socket.  The  actual  storage  associated  with  a  socket  queue  may  fluctuate  during  a 
rocket’s  lifetime,  but  is  assumed  this  reservation  will  always  allow  a  protocol  to  acquire  enough 
memory  to  satisfy  the  high  water  marks. 

The  timeou,  and  select  values  a*e  manipulated  by  the  socket  routines  in  implementing  vari¬ 
ous  portions  of  the  interprocess  communications  facilities  and  will  not  be  described  here. 
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A  socket  queue  has  s  number  of  flags  used  in  synchronising  access  to  the  data  and  in  acquir¬ 
ing  resources; 

^define  SBJLOCK  0x01  /•  lock  on  data  queue  (so_rcv  only)  */ 

^define  SB_W AN T 0x02  /•  someone  is  waiting  to  lock  */ 

^define  SB_WAIT  0x04  /•  someone  is  waiting  for  data/apace  •/ 

^define  SB_SEL  0x08  /•  buffer  is  selected  •/  •  ' 

^define  SB_COLL  0x10  /•  collision  selecting  •/ 

The  last  two  flags  are  manipulated  by  the  system  in  implementing  the  select  mechanism. 

t  I 

l.l.g.  Socket  connection  queueing 

In  dealing  with  connection  oriented  sockets  (e.g.  SOCK_STREAM)  the  two  side',  are  con¬ 
sidered  distinct.  One  side  is  termed  strive,  and  generates  connection  requests.  The  other  side  is 
called  passive  and  accepts  connection  requests. 

From  the  passive  aide,  a  sociiet  is  created  with  the  option  SO^ACCLFTCO NN  specified, 
creating  two  queues  of  sockets:  tojtjO  for  connections  in  progress  and  «o_?  for  connections  already 
made  and  awaiting  user  acceptance.  As  a  protocol  is  preparing  incoming  connections,  it  creates  a 
socket  structure  queued  on  tojqO  by  calling  the  routine  sonewconnQ.  When  the  connection  is 
established,  the  socket  structure  is  then  transfered  to  *o_f,  making  it  available  for  an  accept. 

If  an  SO_ACCEPTCONN  socket  is  closed  with  sockets  on  either  $e_qO  or  so_*,  these  sock¬ 
ets  are  dropped. 


6  .2.  Protocol  layer(a) 

Protocols  are  described  by  a  set  of  entry  points  and  certain  socket  visible  characteristics, 
some  of  which  are  used  in  deciding  which  socket  type(s)  they  may  support. 

An  entry  in  the  “protocol  switch"  table  exists  for  each  protocol  module  configured  Into  the 
system.  It  has  the  following  form: 


struct  protosw  { 

short  pr_type;  /* 

short  pr_family;  /• 

short  pr.protocol; 
short  pr_flags;  /• 

/•  protocol-protocol  hooks  */ 
int  (•prjmputJO; 
int  (•pr_output)(); 
int  (•pr_ctlinput)();  /• 
int  (•pr_etloutput)(); 

/•  user-protocol  hook  •/ 
int  (»pr_usTTeq)(); 

/•  utility  hooks  •/ 

int  (*prjnit)(); 
int  (•pr_fasttimo)();  /• 
int  (•pr_p!owtimo)();/» 
int  (•pr_drain)(); 

}; 


socket  type  uacJ  for  */ 
protocol  family  •/ 

/•  protocol  number  •/ 
socket  visible  attributes  •/ 

/•  input  to  protocol  (from  below)  •/ 
/•  output  to  protocol  (from  above)  •/ 
control  input  (from  below)  •/ 

/•  control  output  (from  above)  •/ 

/•  user  request  •/ 

/•  initialisation  routine  •/ 
fast  timeout  (200ms)  •/ 
slow  timeout  (800ms)  •/ 

/•  flush  any  excess  space  possible  •/ 


A  protocol  is  called  through  the  prjnit  entry  before  any  other.  Thereafter  It  is  called  even 
200  milliseconds  through  the  prjttttimo  entry  and  every  800  nullisecoods  through  the 
pr  tlowtimo  for  timer  based  actions.  The  system  will  call  the  pr_ir;»  entry  if  it  is  low  on  space 
and  this  should  throw  away  any  non-critieal  data. 

Protocols  pass  data  between  themselves  as  chains  of  mbufs  using  the  prj%p%t  and  pr_o*tput 
routines  Pr_tnp*t  passes  data  up  (towards  the  user)  and  pr.output  passes  it  down  (towards  the 
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network);  control  information  passes  up  and  down  on  pr_c Uinput  and  pr^ctlontpnt.  The  protocol 
is  responsible  for  the  space  occupied  by  any  the  argument*  to  these  entries  and  must  dispose  of  it. 

The  pr_**crrcq  routine  interface*  protocols  to  the  socket  code  and  is  described  belo-% . 

The  pr_Jl*pt  field  is  constructed  from  the  following  values: 

^define  PRjkTOMIC  0x0 1  /•  exchange  atomic  messages  only  •/  • 

#de&ne  PR_ADDR0x02  /•  addresses  given  with  messages  •/ 

#define  PR_OONNREQUIRED  0x04  /•  connection  required  by  protocol  •/ 

Jdefine  PR~WANTRCVD  0x08  /*  want  PRU_RCVD  calls  •/ 

#define  PRJHGHTS  0x10  /•  passes  capabilities  */ 

TWrols  which  are  connection-based  specify  the  PR.CONNREQUIRHD  flag  sotha-  the  rocket 
routines  will  never  attempt  to  send  daU  before  a  connection  has  be*n  established.  If  the 
PR  WANTRCVD  flag  is  set,  the  socket  routines  will  notfiy  the  protocol  when  the  user  ha* 
removed  data  from  the  socket’s  receive  queue.  This  allows  the  protocol  to  implement  ack¬ 
nowledgement  on  user  receipt,  and  also  update  windowing  information  based  on  the  amount  of 
ro.ce  available  in  the  receive  queue.  The  PR_ADDR  field  indicates  any  daU  placed ^  the 
Scket’s  receive  queue  will  be  preceded  by  the  address  of  the  sender.  The  PR_ATOMIC  flag 
specifies  each  request  to  send  data  must  be  performed  in  a  single  r.fscc/  “ 

the  protocol’s  responsibility  to  maintain  record  boundaries  on  data  to  be  sent.  The  PR JUGHils 
flag  indicates  the  protocol  supports  the  passing  of  capabilities;  this  is  currently  used  only  the  pr^ 

toco!*  in  the  UNIX  protocol  family. 

When  a  socket  is  created,  the  socket  routines  scan  the  protocol  table  looking  for  an 
appropriate  protocol  to  support  the  type  of  socket  being  created.  The  prjppc  field  contains  one 
the  possible  socket  type*  (e  g.  SOCK.STREAM),  while  the  prjtmip  field  indicate,  which  pro- 
tocol  fanriy  the  protocol  belongs  to.  The  protocol  field  contain,  the  protocol  number  of  the 
protocol,  normally  a  well  known  value. 

e.8.  Network-Interface  layer 

Each  network-interface  configured  into  a  system  defines  a  path  through  which  Packets  may 
be  aent  and  received.  Normally  a  hardware  device  is  associated  with  this  mterface,  though  there 

rr«,“r.n«nt  for  dn.  (to,  «unpl«,  »U  wum.  *  «*-«  “M***"  T^Tinu,' 
debugging  and  performance  analysis).  In  addition  to  manipulating  the  hardware  device,  an  inter¬ 
face  Zdnh  is  responsible  for  encapsulation  and  deencapsulation  of  any  low  level  header  mforma- 
tion  required  to  deliver  a  message  to  it’s  destination.  The  selection  of  which  interface  to 
delivering  packets  is  a  routing  decision  carried  out  at  a  higher  level  than  the  “‘"^“J*1*** 
layer.  Each  interface  normally  identifies  : ..self  at  boot  time  to  the  routing  module  *>  that  it  may 

be  selected  for  packet  delivery. 

An  interface  is  defined  by  the  following  structure, 
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•tract  ifnet  { 

char  *if_pame;  /•  nan.  «.g.  "en"  or  "lo”  •/ 

short  tf„unjt;  /•  su'  -c^it  for  lover  level  driver  •/ 

short  if_mtu;  /-  itiiximuin  transmission  unit  •/ 

i^t  if_pet;  *!-ttwork  number  of  interface  •/ 

short  if_flags,  /•  up/uown,  broadcast,  etc.  •/  * 

short  if_timer;  /*  time  'til  if„wate)\dog  called  •/ 

int  ifJhost[2];  /•  local  Det  host  number  •/ 

struct  aockaddr  if_addr;  /*  address  of  interface  •/ 

union  { 

stru'.t  aockaddr  ifu_broadaddr; 

struct  aockaddr  ifu_dstaddr; 

}  if.ifu; 

struct  ifqueue  if_jnd;  /•  output  queue  •/ 

int  (•if_init)();  /•  init  routine  •/ 

int  (•if_output){);  /•  output  routine  •  / 

int  (*if_ioctl)();  /*  ioctl  routine  •/ 

int  (l*if_reset)();  /•  bus  reset  routine  •/ 

int  (•if_watchdog)();/*  timer  routine  •/ 

int  ifjpsckets,  /•  packets  received  on  interface  •/ 

int  ifjerrors;  /*  input  errors  on  interface  */ 

int  if_opackets;  /■*  packets  sent  on  interface  •/ 

int  i/_oerrors,  /•  output  errors  od  interface  •/ 

int  if_collisions;  /•  eo  lisicns  od  cama  interfaces  •/ 

struct  ifDet  •if_pext; 

>; 

Each  interface  has  a  tend  queue  and  routines  used  for  initialisation,  ifjimit,  and  output, 
If  the  interface  resides  od  a  systei  bus,  the  routine  will  be  called  after  a  bus 

reset  has  beeD  performed.  An  interface  may  also  specify  a  timer  routine,  ii_watckdof,  which 
should  be  called  every  \JjUmer  seconds  (if  non  sero). 

The  state  of  an  interface  and  certain  characteristics  are  stored  in  the  ft  field.  The  fol¬ 
lowing  values  are  possible. 

#define  IFF_UP  Oxl  /•  mterfsce  is  up  •/ 

#defioe  IFF_PROAl>CAST  Or vS  /•  broadcast  address  valid  •/ 

#define  IFF_PEBUG  0x4  /•  un  od  debugging  •/ 

# define  IFF_ROUTE  0x8  /-  , "outing  entry  installed  •/ 

#define  IFF_POINTTOPOINT  O';: 3 0  /•  interface  is  point-to-point  link  •/ 

^define  If  F  J'JOl TtAJLFR  ■>  ->.20  /•  avoid  use  of  trailers  •/ 

<#acr  a-  W':  JV  \  0x40  /•  resour- e*  allocated  •/ 

#dtfu;c  iFFJvOARP  0x80  /•  no  address  resolution  protocol  •/ 

If  the  interface  is  connected  to  a  network  which  supports  transmission  of  iro*dc*it  packets,  the 
IFFJBROADCA5T  flag  will  he  set  and  the  if_kroadaddr  field  will  cootaiD  the  address  to  be  used 
in  sending  or  accepting  a  broadcast  packet.  If  the  interface  is  associated  with  a  point  to  point 
hardware  link  (for  example,  a  DEC  DMR-11),  the  IFF_POINTOPOINT  flag  will  he  set  and 
ifjdttaddr  will  contain  the  address  of  the  host  od  the  other  side  of  the  connection.  These 
addresses  and  the  local  address  of  the  interface,  ij.addr,  are  used  in  filtering  incoming  packets. 
The  interface  sets  IFF _RUNNTNG  after  it  has  allocated  system  resources  and  posted  an  initial 
read  on  the  device  it  manages.  This  state  bit  is  used  to  avoid  multiple  allocation  requests  when 
an  interface’s  address  is  changed.  The  IFF_N 0 TRAILERS  flag  indicates  the  interface  should 
refrain  from  using  a  trailer  encapsulation  oo  outgoinr  packets;  trailer  protocols  art  described  in 
sectioL  14.  The  IFFJ^OARP  flag  indicates  the  interface  should  not  use  an  "address  resolution 
protocol"  in  mapping  internetwork  addresses  to  local  network  addresses. 
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The  information  stored  in  an  i/net  structure  for  point  to  point  communication  device*  u  not 
currently  used  by  the  system  internally.  Rather,  it  is  used  by  the  user  level  routing  process  m 
determining  host  network  connections  snd  in  initislly  devising  routes  (refer  to  chapter  10  for  more 
information). 

Various  statistics  are  also  stored  in  the  interface  structure.  These  may  be  viewed  by  users 
astog  the  *«tsi«f(l)  program.  *  ' 

The  interface  address  and  flag*  may  be  set  with  the  SIOCSIFADDR  and  SKXfeif  1  'LAG S 
foctb.  SIOCSIFADDR  is  used  to  initially  define  each  interface’s  address;  SlOGSlfK-AGS  can  be 
aoed  to  mark  an  interface  down  and  perform  site-specific  configuration. 


•41.1.  UNIBUS  Interfaces 

All  hardware  related  interface*  currently  reside  on  the  UNIBUS.  Consequently  a  common 
•et  of  utility  routines  for  dealing  with  the  UNIBUS  has  been  developed.  Each  UNIBUS  interface 
utilises  a  structure  of  the  following  form: 


struct 


ifuba  { 

short  ifu_uban;  /•  uba  number  •/ 

short  ifu_hlen;  /•  local  net  header  length  •/ 

struct  ubajegs  *ifu_uba;  /•  uba  rep,  in  vm  •/ 

struct  ifrw  { 

caddr_t  ifrw_addr;  /•  virt  addr  of  header  •/ 

int  ifrw_bdp;  /•  unibus  bdp  •/ 

int  ifrw_^nfo;  /•  value  from  ubaalloc  •/ 

int  ifrw_proto;  /*  map  register  prototype  •/ 

struct  pte  •ifrw<_mr;/»  base  of  map  registers  •/ 

)  ifu_r,  ifu_w; 

struct  pte  ifu_wmap[F_MAXNUBAMR);/*  base  pages  for  output  •/  * 
short  ifu_yswapd;  /•  mask  of  cluster:  swapped  •/ 

short  ifu.flags,  /•  used  during  ubaUoc’s  •/ 

struct  mbuf  *ifujrtofree;  /•  pages  being  dma’d  out  •/ 


The  ifj*ha  structure  describes  UNIBUS  resources  held  by  an  interface.  IFJWBAMR  map 
registers  are” held  for  datagram  data,  starting  at  UNIBUS  map  register  tfr_mr|-l]  maps 

the  local  network  header  ending  on  a  page  boundary.  UNIBUS  data  paths  are  reserved  for  read 
and  for  write,  given  by  ifrjip.  The  prototype  of  the  map  registers  for  read  and  for  wnte  is 

saved  in  ifr_froto. 

When  write  transfers  art  not  full  pages  on  oage  boundaries  the  data  u  just  copied  mto  the 
pages  mapped  on  the  UNIBUS  and  the  transfer  is  started.  If  a  write  tranafer  is  of  a  (1024  byte) 
page  she  and  on  a  page  boundary,  UNIBUS  page  Uble  entries  are  swapped  to  reference  the  pages, 
and  then  the  initial  pages  are  remapped  from  ifu_wmap  when  the  transfer  completes. 

When  read  transfers  give  whole  pages  of  data  to  be  input,  page  frames  are  allocated  from  a 
network  page  list  and  traded  with  the  pages  already  containing  the  data,  mapping  the  allocated 
pages  to  replace  the  input  page*  for  the  n  'Xt  UNIBUS  dnta  input. 

The  following  utility  routines  sit  available  for  use  in  writing  network  interface  drivers,  all 
me  the  ifnka  structure  described  above. 

if  ubainitfifu,  uban,  hlen,  nmr);  .  ,  . 

"  jjrlail  allocates  mource.  on  UNIBUS  adaptor  .tan  and  tom  the  remiltant  ^formation 

to  the  (Ms  structure  pointed  to  by  if*  It  is  called  only  at  boot  tune  or  rfur  a  WBUS 
met  Two  daU  paths  (buffered  or  unbuffered,  depending  on  the  tfsJUf. 
one  for  reading  and  one  for  writing.  The  «mr  parameter  indicates 

mapping  register*  required  to  map  a  maximal  ailed  packet  onto  ( the  UNIBUS,  while  Utn 
specifies* the  rite  of  a  local  network  header,  if  any,  which  should  be  mapped  separately  from 
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the  date  (tee  the  description  of  trailer  protocols  in  ehspter  14).  Sufficient  UNIBUS  mapping 
registers  and  pages  of  memory  art  allocated  to  initialise  the  input  data  path  for  an  uutiJ 
read  For  the  output  data  path,  mapping  register*  and  pages  of  memory  art  also  »U«-»t«a 
Md  mapped  onto  the  UNIBUS.  The  pages  amociated  with  the  output  daU  path  art  held  in 
reserve  in  the  event  a  write  requLee  copying  non-page-aligned  daU  (see  below) 

If  if  eltmtf  is  called  with  resource  already  allocated,  they  will  be  used  instead  of  allocating 
new  ones  (this  normally  occur,  after  a  UNIBUS  re*t).  A  1  »  returned  when  allocation  and 
initialisation  is  successful,  0  otherwise. 

"  "  ^i^et^puJkriad  data  off  an  interface,  totlen  specifies  the  length  of  data  to  be  obtained, 
#ot  counting  the  local  network  header.  If  offO  is  non-.ero,  it  indicates  a  byte  offset  to  a 
trailing  local  network  header  which  should  be  eopied  into  a  separate  mbuf  and  prepended  to 
the  front  of  the  resultant  mbuf  ehain.  When  page  sited  unit*  of  data  are  present  and  are 
page-aligned,  the  previously  mapped  data  pages  are  remapped  into  the  mbufs  and  swapped 
with  fresh  pages;  thus  avoiding  any  eopying.  A  0  return  value  indicates  a  failure  to  allocate 

resources. 

if_wubaput(^  m^,  ^  of  Bbuf#  onto  %  network  interface  in  preparation  for  output.  The 

chain  includes  any  local  network  header,  which  is  copied  »o  that  it  resides  in  the  mapped 
and  aligned  I/O  space.  Any  other  mbufs  which  contained  non  page  sited  data  portions  are 
abo  copied  to  the  I/O  space.  Pages  mapped  from  a  previous  output  operation  (no  longer 
needed)  are  unmapped  and  returned  to  the  network  page  pool. 


7.  Socket/protocol  interface 

The  interface  between  the  socket  routines  and  the  communication  protocols  is  through  the 
fr_%trreq  routine  defined  in  the  protocol  switch  table.  The  following  request*  to  a  protocol 

module  are  possible: 


^define  PRU_ATTACH  0  / 

^define  PRU.DETACH  1  / 

^define  PRUJJIND  2  / 

#define  PRUJ.1STEN  3  / 

#define  PRU  CONNECT  4 
#define  PRU_ACCEPT  5  / 

#  define  PRUJ>ISCONNECT  < 
#define  PRU_SHUTDOWN  1 
^define  PRU_RCVD  8  ; 

^define  PRU_SEND  8  ; 

#define  PRU^BORT  10  , 

#define  PRU_CONTROL 
#define  PRU.SENSE  12 
#define  PRUJtCVOOB  13 
^define  PRU.SENDOOB 
^define  PRU_SOCKADDR 
#define  PRUJ»EERADDR 
^define  PRU.CONNECT2 
/*  begin  for  protocols  internal  use 
#define  PRUJASTTIMO 
#define  PRU„SLOWTIMO 
^define  PRU^fROTORCV 
^define  PRUJ>ROTOSEND 


/•  attach  protocol  •/ 

/•  detach  protocol  •/ 

/•  bind  socket  to  address  •/ 

/*  listen  for  connection  •/ 

4  /•  establish  connection  to  peer  */ 

/*  aceept  connection  from  peer  •/ 

0  /•  disconnect  from  peer  •/ 

7  /•  won't  send  any  more  data  •/ 

/•  have  taken  data;  more  room  now  •/ 

/*  send  this  data  •/ 

/.  abort  (fast  DISCONNECT,  DETATCH)  •/ 
11  /*  control  operations  on  protocol  */ 

/•  return  status  into  m  •/ 

/•  retrieve  out  of  band  data  •/ 

14  /•  send  out  of  band  data  •/ 

15  /•  fetch  socket’s  address  •/ 

16  /•  fetch  peer’s  address  •/ 

17  /•  connect  two  socket*  •/ 


/•  200ms  timeout  •/ 

/•  500ms  timeout  •/ 

/•  receive  from  below  */ 
/•  send  to  below  •/ 


A  call  on  the  user  request  routine  is  of  the  form, 
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error  —  (*protosw[]  pr_usmqXup,  req,  in,  addr,  rights); 

int  error,  struct  rocket  *up;  int  req;  struct  mbuf  *m,  ‘rights;  caddr_t  addr; 

The  mbuf  chain,  m,  and  the  add  re**  art  optional  parameter*.  The  njhU  pamnete- u  an  optional 
pobter  to  an  mbuf  chain  containing  u*er  *pecified  eapabilitie*  (aee  the  and  rccrmjg  *y* 

tem  calls).  The  protocol  is  responsible  for  disposal  of  both  mbuf  chain*.  A  mw-aer^return  value 
gm*  a  UNIX  error  number  which  should  be  passed  to  higher  level  software.  The  foDowing  para- 
grnph*  describe  each  of  the  requests  possible. 

FEU  ATTACH  ' 

When  a  protocol  is  bound  to  a  socket  (with  the  socket  system  call)  the  protocol  module  is 

called  with  this  request.  It  is  the  responsibility  of  the  protocol  module  to  allocate  any 
resources  necessary.  The  “attach”  request  will  always  precede  any  of  the  other  requests, 
and  should  not  occur  mere  than  once, 
pntf  hftach 

This  is  the  antithesis  of  the  attach  request,  and  is  used  at  the  time  a  socket  is  deleted.  The 
protocol  module  msy  deallocate  any  resources  assigned  to  the  socket. 

PRU’When  a  socket  is  initially  created  it  has  no  address  bound  to  it.  This  request  indicates  an 
address  should  be  bound  to  an  existing  socket.  The  protocol  module  must  verify  the 
requested  address  is  valid  and  available  for  use. 

PRU  LISTEN 

The  “listen”  request  indicates  the  user  withes  to  listen  for  incoming  connection  requests  on 
the  associated  socket.  The  protocol  module  should  perform  any  state  changes  needed  to 
carry  out  this  request  (if  possible).  A  “listen”  request  always  precedes  a  request  to  accept  a 

connection. 

^eT^ct”  request  indicates  the  user  wants  to  a  artablish  an  association.  The  •Hr 
parameter  supplied  describes  the  peer  to  be  connected  to.  The  eJecti‘t«BDKl  'Jlutsl 
I^Tvary  depending  on  the  protocol.  Virtual  circuit  protocols,  such  as  TCP  (Postel80b]  use 
thickest  to  initiate  establishment  of  a  TCP  connection.  DaUgram  protocols,  such  ns 
UDP  [Poetel70],  simplv  record  the  peer’s  address  in  a  private  data  structure  and  use  it 
tag  all  outgoing  packet*  There  are  no  restrictions  on  how  many  times  a  connect  request 
may  be  used  after  an  attach.  If  a  protocol  supports  the  notion  of  m«/li-e**hnf,  it  »  po«i  le 
to  L  multiple  connects  to  establish  s  multi-cast  group.  Alternatively ,  an  sssocut.on  may 
be  broken  by  a  PRUJDISCONNECT  request,  and  a  new  association  created  with  a  subse¬ 
quent  connect  request;  all  without  destroying  and  creating  a  new  socket. 

PRU  a  successful  PRUJLISTEN  request  and  the  arrival  of  one  or  more  connections, 

thu  request  is  made  to  indicate  the  user  has  accepted  the  first  connection  on  the  queue  of 
pendinreonnections.  The  protocol  module  should  fill  a  the  supplied  addrem  buffer  with  the 

address  of  the  connected  party. 

PBU-S^^i«io»  with  .  PRU.OONNECT  -wp**. 

*  Micw  mor.  <l.u  will  b,  ««  urf/or  r.c,iv«i  (the.!* 
indicates  the  direction  of  the  shutdown,  as  encoded  in  the  ,o,hntiov* » 
protocol  msy,  at  it*  discretion,  deallocate  any  data  structures  related  to  the  shutdown. 

PRU’ThU^quest  is  made  only  if  the  protocol  entry  in  the  protocol  switch  Uble  Includes  the 
PR  WANTRCVD  flag  When  a  user  removes  dats  from  the  receive  queue  this  request  will 
L  ’Z^Tirov?o\  module.  It  msy  be  u«d  to  trigger  acknowledgements,  refre*  win¬ 
dowing  information,  initiate  data  transfer,  etc. 


—  December  1985  — 


Find  Report  Appendix  C  -  -  Socket/protocoS  Interface 

PRU.SEND 

Each  user  request  to  send  dsU  u  trs  ued  into  one  or  more  PRU.SEND  requests  (»  proto¬ 
col  may  udicste  *  single  user  send  ^uest  must  be  translated  into  s  single  PRUJSEND 
request  by  specifying  the  PR_ATOMIC  flag  in  its  protocol  description)  The  data  to  be  sent 
is  presented  to  the  protocol  as  a  list  of  mbufs  and  an  address  is,  optionally,  supplied  in  the 
tidr  parameter.  The  protocol  is  responsible  for  preserving  the  data  in  the  greket’s  send 
queue  if  it  is  not  able  to  send  it  immediately,  or  if  it  may  need  it  at  some  later  time  (e.g.  for 
retransmission ) . 

PRU_AbORT 

This  request  indicates  an  abnormal  termination  of  service.  The  protocol  rh ould  delete  any 
existing  assomtion(s) 

PRU_CONTROL 

The  “control”  request  is  generated  when  a  user  performs  a  UNIX  ioetl  system  call  on  a 
socket  (and  the  ioetl  is  not  intercepted  by  the  socket  routines).  It  allows  protocol-specific 
operations  to  be  provided  outaide  the  scope  of  the  common  socket  interface.  The  tddr 
paramef  «r  contains  a  pointer  to  a  static  kernel  data  area  where  relevant  information  may  be 
obtained  or  returned.  The  m  parameter  contains  the  actual  ioetl  request  code  (note  the 
non-standard  calling  convention). 

PRUJ5ENSE 

1  ie  “sense”  request  is  generated  when  the  user  makes  an  fotot  system  call  on  a  socket;  it 
requests  statui  of  the  associated  socket.  There  currently  is  no  common  format  for  the  status 
returned.  Information  which  might  be  returned  includes  per-eocnection  statistics,  protocol 
state,  resources  currently  in  use  by  the  connection,  the  optima  transfer  siie  for  the  connec¬ 
tion  (based  on  windowing  information  and  maximum  packet  site).  The  tddr  parameter  con¬ 
tains  a  pointer  to  a  static  kernel  data  area  where  the  status  buffer  should  be  placed. 
PRUJRCVOOB 

Any  “out-of-band”  data  presently  available  it  to  be  returned.  An  mbuf  it  passed  in  to  the 
protocol  module  and  the  protocol  ahould  either  place  data  in  the  mbuf  or  attach  new  mbufs 
to  the  one  supplied  if  there  is  insufficient  space  in  the  single  mbuf. 

PRILSENDOOB 

Like  PRU_SEND,  but  for  out-of-banc  data 
PRU.SOCKADDR 

The  local  address  of  the  socket  is  returned,  if  any  is  uunently  bound  to  the  it.  The  address 
format  (protocol  specific)  is  returned  in  the  tidr  parameter. 

PRU_PEERADDR 

The  address  of  the  peer  to  which  the  socket  is  connected  is  returned.  The  socket  must  be  in 
a  SSJSCONNECTED  state  for  this  request  to  be  made  to  the  protocol.  The  address  for¬ 
mat  (protocol  specific)  is  returned  in  the  tidr  parameter. 

PRU.CONNECT2 

The  protocol  module  is  supplied  two  sockets  and  requested  to  establish  a  connection 
between  the  two  without  binding  any  addresses,  if  possible.  This  call  is  used  'in  implement¬ 
ing  the  system  call. 

The  following  requests  are  used  internally  by  the  protocol  modules  and  are  never  generated 
by  the  socket  routines.  In  certain  instances,  they  are  banded  to  the  prjkorroq  routine  solely  for 
convenience  in  tracing  a  protocol’s  operation  (e.g  PRU_SLOWTIMO). 

PRU_FASTTIMO 

A  “fast  timeout”  has  occured  This  request  is  made  when  a  timeout  occurs  in  the  protocol’s 
fr_Jt*timo  routine  The  tidr  parameter  indicates  which  timer  expired. 

PRU.SLOWTMO 

A  “slow  timeout”  has  occured.  This  request  is  made  when  a  timeout  occurs  in  the 
protocol’s  prjolewtimo  routine  The  tidr  parameter  indicates  which  timer  expired. 
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PRILPROTORCV 

This  request  is  used  in  the  protocol-protocol  interface,  not  by  the  routine*.  It  requests 
reception  of  date  destined  for  the  protocol  and  not  the  user.  No  protocob  currently  use  this 
facility. 

PRU_PROTOSEND 

This  request  allows  >,  protocol  to  send  data  destined  for  another  protocol  module,  not  a  user. 
The  details  of  how  data  is  marked  "addressed  to  protocol"  instead  of  “addremed  to  user” 
are  left  to  the  protocol  modules.  No  protocob  currently  use  this  facility. 


•.  Protocol/protocol  interface 

The  interface  between  protocol  modules  is  through  the  pr_esrr^q,  prjnpet,  pr_o*tput, 
pr_ctlinpnt,  and  pr_ctlo%tp%t  routines.  The  calling  conventions  for  all  but  the"pr_«srTej"routine 
are  expected  to  be  specific  to  the  protocol  modules  and  are  not  guaranteed  to  be  consistent  across 
protocol  families.  We  will  examine  the  conventions  used  for  some  of  the  Internet  protocob  in  this 
section  as  an  example. 

IX  pr_output 

The  Internet  protocol  UDP  uses  the  convention, 

error  — >  udp_output(inp,  m); 

int  error;  struct  inpeb  »inp;  struct  mbuf  »m; 

where  the  inp,  "internet  protocol  control  llock",  passed  between  modules  conveys  per  connection 
state  information,  and  the  mbuf  chain  contains  the  data  to  be  sent.  UDP  performs  consistency 
checks,  appends  its  header,  calculates  a  checksum,  etc.  before  passing  the  packet  on  to  the  IP 
module: 


error  ip_output(m,  opt,  ro,  allowbroadcast); 

int  error;  struct  mbuf  *m,  *opt;  struct  route  sro;  int  allowbroadcast; 

The  call  to  IP’s  output  routine  is  more  complicated  than  that  for  UDP,  as  befits  the  addi¬ 
tional  work  the  IP  module  must  do.  The  m  parameter  is  the  data  to  be  sent,  and  the  opt  parame¬ 
ter  u  an  optional  list  of  IP  options  which  should  be  placed  in  the  IP  packet  header.  The  ro 
parameter  is  b  used  in  making  routing  decisions  (and  passing  them  back  to  the  caller).  The  final 
parameter,  ellowiroedcett  b  a  flag  indicating  if  the  user  b  allowed  to  transmit  a  broadcast  packet. 
This  may  be  inconsequential  if  the  underlying  hardware  does  not  support  the  notion  of  broadcast¬ 
ing 

All  output  routines  return  0  on  success  and  a  UNIX  error  number  if  a  failure  occured  which 
could  be  immediately  detected  (no  buffer  space  available,  no  route  to  destination,  etc.). 

*. 2 .  pr_lnput 

Both  UDP  and  TCP  use  the  following  calling  convention, 

(void)  (•protosw[].pr_input)(m), 
struct  mbuf  *in; 

Each  mbuf  list  passed  is  a  single  packet  to  be  processed  by  the  protocol  module. 

The  IP  input  routine  is  a  VAX  software  interrupt  level  routine,  and  so  b  not  called  with  any 
parameters  It  instead  communicates  with  network  interfaces  through  a  queue,  ipwtrq,  which  is 
identical  in  structure  to  the  queues  used  by  the  network  interfaces  for  storing  packets  awaiting 
transmission 
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8.8.  pr.ctlinput 

Thi»  routine  is  used  to  convey  “control"  information  to  a  protocol  module  (i.e.  information 
which  might  be  passed  wo  the  user,  but  is  not  data).  This  routine,  and  the  pr.et/ostpst  routine, 
have  not  been  extensively  developed,  and  thus  suffer  from  a  “clumsiness"  that  can  only  be 
improved  as  more  demands  are  placed  on  it. 

The  common  calling  convention  for  this  routine  is,  • 


(void)  (‘protosw [j ,pr_c tlinpu t^req,  info); 
int  req,  caddr_t  info; 


The  r«£  parameter  is  one  of  the  following, 

#define  PRCJFDOWN 
#define  PRC_ROUTEDEAD 
#aefine  PRC.QUENCH 
#define  PRCJiOSTDEAD 
#dcfine  PRCJIOSTUNREACH 
#define  PRC_UNREACH.NET 
#define  PRC_UNREACH_HOST 
#define  PRC_UNREACH_PROTOOO 
#define  PRCJJNREACHJ’ORT 
#define  PRCJ/tSGSIZE 
#define  PRC JIEDIRECT.NET 
#define  PRC  .REDIRECT  JKOST 
#define  PRC_TIMXCEED JNTRANS 
#de£ne  PRC_TIMXCEED_REASS 
#define  PRC _PARAMPROB 


0  /•  interface  transition  •/ 

I  /*  select  new  route  if  possible  •/ 

4  /*  some  said  to  slow  down  •/ 

6  /*  normally  from  IMP  •/ 

7  /•  ditto  •/ 

8  /•  no  route  to  network  •/ 

8  /•  no  route  to  host  •/ 

10  /•  dst  says  bad  protocol  •/ 

II  /*  bad  port  #  •/ 

12  /•  message  sue  forced  drop  •/ 

IS  /•  net  routing  redirect  •/ 

14  /•  host  routing  redirect  •/ 

17  /•  packet  lifetime  expired  in  transit  »/ 

18  /•  lifetime  expired  on  reass  q  •/ 

18  /•  header  incorrect  •/ 


while  the  info  parameter  is  a  “catchall"  value  which  is  request  dependent.  Many  of  the  requests 
have  obviously  been  derived  from  ICMP  (the  Internet  Control  Message  Protocol),  and  from  error 
messages  defined  in  the  1822  hoet/IMP  convention  [BBN78].  Mapping  tables  exist  to  convert  con¬ 
trol  requests  to  UNIX  error  codes  which  art  delivered  to  a  user. 


t.4.  pr_ctk>utput 

This  routine  is  not  currently  used  by  any  protocol  modules. 


\ 


9.  Protocol/network-interface  Interface 

The  kn-est  layer  in  the  set  of  protocols  which  comprise  a  protocol  family  must  interface 
itself  to  one  or  more  network  interfaces  in  order  to  transmit  and  receive  packets.  It  is  assumed 
that  any  routing  decisions  have  been  made  before  handing  a  packet  to  a  network  interface,  in  fact 
this  is  absolutely  necessary  in  order  to  locate  any  interface  at  all  (unless,  of  course,  one  uses  a  sin¬ 
gle  “hardwired”  interface).  There  are  two  cases  to  he  concerned  with,  transmission  of  a  packet, 
and  receipt  of  a  packet;  each  will  be  considered  separately. 

8.1.  Packet  transmission 

Assuming  a  protocol  has  a  handle  on  an  interface,  iff,  a  (struct  ifnet  •),  it  transmits  a  fully 
formatted  packet  with  the  following  call, 

error  •  (•ifp->if.outputX>fp,  m,  dst) 

int  error;  struct  ifnet  *ifp;  struct  mbuf  *m;  struct  toekaddr  *dst; 

The  output  routine  for  the  network  interface  transmits  the  packet  m  to  the  itt  address,  or  returns 
an  error  indication  (a  UNIX  error  number).  In  reality  transmission  may  not  be  immediate,  or  suc¬ 
cessful;  normally  the  output  routine  simply  queues  the  packet  on  iu  send  queue  and  primes  an 
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interrupt  driven  routine  to  actually  transmit  the  packet.  For  unreliable  mediums,  such  as  the 
Ethernet,  “successful"  transmission  simply  means  the  packet  has  been  placed  on  the  cable 
without  a  collision.  On  the  other  hand,  an  X  2  interface  guarantees  proper  delivery  or  an  error 
indication  for  each  message  transmitted.  The  model  employed  in  the  networking  system  attaches 
no  promises  of  delivery  to  the  packets  handed  to  a  network  interface,  and  thus  eoiresponds  more 
closely  to  the  Ethernet.  Errors  returned  by  the  output  routine  art  normally  trivial  m  nature  (no 
buffer  space  address  format  not  handled,  etc.). 


•  J.  Packet  raeeptlon 

Etch  protocol  family  mus  have  one  or  more  “lowest  level"  protocols.  These  protocols  deal 
with  internetwork  addressing  and  are  responsible  for  the  delivery  of  incoming  packets  to  the 
proper  protocol  processing  modules  In  the  PUP  model  [Boggs78]  these  protxols  art  termed 
Level  1  protocols,  in  the  ISO  model,  network  layer  protocols.  In  our  system  each  such  protocol 
module  has  an  input  packet  queue  assigned  to  it.  Incoming  packets  received  by  a  network  inter¬ 
face  are  queued  up  for  the  protocol  module  and  a  VAX  software  interrupt  is  posted  to  initiate 

processing 

Three  macros  are  available  for  queueing  and  dequeueing  packets, 

IF_ENQUEUE(ifq,  m) 

This  places  the  packet  m  at  the  tail  of  the  queue  ifq. 

IF _J>EvUEUE(ifq,  m)  w 

This  places  a  pointer  to  the  packet  at  the  head  of  queue  ifq  in  m,  A  aero  value  will  be 

returned  in  m  if  the  queue  is  empty. 


IF_J,REPEND(ifq,  m) 

This  places  the  packet  m  at  the  head  of  the  queue  ifq 

Each  queue  has  a  maximum  length  associated  with  it  as  a  simple  form  of  congestion  control 
The  macro  IF_QFULL(.'fq)  returns  1  if  the  queue  is  filled,  in  which  case  the  macro  Ir_PROP,(ifq) 
should  be  used  to  bump  a  count  of  the  number  of  packets  dropped  and  the  offending  packet 
dropped.  For  example,  the  following  code  fragment  is  commonly  found  in  a  network  interface’s 

input  routine, 

if  (IF.QFULL(inq))  { 

IFJXROPvinq); 

m_freem(m); 

}  else 

IFJENQUEUE(inq,  m); 


10.  Gateway*  and  routing  issues 

The  system  has  been  designed  with  the  expectation  that  it  will  be  used  in  an  internetwork 
environment.  The  “canonic*]”  environment  was  envisioned  to  be  a  collection  of  local  area  net- 
works  connected  at  one  or  more  points  through  hosts  with  multiple  network  interfaces  (one  on 
each  local  area  netwo..  and  possibly  a  connection  to  a  long  haul  network  (for  example,  the 
ARPANET).  In  such  m  environment,  issues  of  gatewaying  and  packet  routing  become  very 
important  Certain  of  these  issues,  such  as  congestion  control,  have  beeD  handled  in  a  simplistic 
manner  or  specifically  not  addressed  Instead,  where  possible,  the  network  system  attempts  to 
provide  simple  mechanisms  upon  which  more  involved  policies  may  be  implemented  As  some  or 
Ihese  problems  become  better  understood,  the  solutions  developed  will  be  incorporated  into  the 

system.  . 

This  section  will  describe  the  facilities  provided  for  packet  routing.  The  simplistic  meeban- 
mms  provided  for  congestion  control  are  described  in  chapter  12. 
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10.1.  Routing  tables 

The  network  system  maintains  a  set  of  routing  tablet  for  selecting  a  network  interface  to 
use  in  delivering  a  packet  to  its  destination.  These  tables  are  of  the  form: 

struct  rtentry  { 

u_k>ng  rt_b“h;  /•  hash  key  for  lookups  •/ 

r  ruct  aockaddr  rt_dst;  /•  destination  net  or  host  •/  • 

struct  sockaddr  rtjateway;  /*  forwarding  agent  •/ 

short  rt_fiags;  /•  see  below  •/ 

short  rtjrefcnt;  /•  no.  of  references  to  structure  •/ 

o_k>ng  rt_use;  /•  packets  sent  using  route  •/ 

struct  ifnet  *rt_ifp;  /•  interface  to  give  packet  to  •/ 

}; 

The  routing  information  is  organised  in  two  separate  tables,  one  for  routes  to  a  host  and  one 
for  routes  to  a  network  The  distinction  between  hosts  and  networks  is  necessary  so  that  a  single 
mechanism  may  be  used  for  both  broadcast  and  multi-drop  type  networks,  and  also  for  networks 
built  from  point-to-point  links  (e.g  DECnet  [DEC80]). 

Each  table  is  organised  as  a  hashed  set  of  linked  lists.  Two  82-bit  hash  values  are  calcu¬ 
lated  by  routines  defined  for  each  address  family;  one  based  on  the  destination  being  a  host,  and 
one  assuming  the  target  is  the  network  portion  of  the  address.  Each  hash  value  is  used  to  locate 
a  hash  chain  to  search  (by  taking  the  value  modulo  the  hash  table  site)  and  the  entire  32-bit 
value  is  then  used  as  a  key  in  scanning  the  list  of  routes.  Lookups  are  applied  first  to  the  routing 
table  for  hosts,  then  to  the  routing  table  for  networks.  If  both  lookups  fail,  a  final  lookup  is  made 
for  a  “wildcard”  route  (by  convention,  network  0).  Fly  doing  this,  routes  to  a  specific  host  on  a 
network  may  be  present  as  well  as  routes  to  the  network.  This  also  allows  a  “fall  back”  network 
route  to  be  defined  to  an  “smart”  gateway  which  may  then  perform  more  intelligent  routing. 

Each  routing  table  entry  contains  a  destination  (who’s  at  the  other  end  of  the  route),  a  gate¬ 
way  to  send  the  packet  to,  and  various  flags  which  indicate  the  route’s  status  and  type  (host  or 
network).  A  count  of  the  number  of  packets  sent  using  the  route  is  kept  for  use  in  deciding 
between  multiple  routes  to  the  same  destination  (see  below),  and  a  count  of  “held  references"  to 
the  dynamically  allocated  structure  is  maintained  to  insure  memory  reclamation  occurs  only  when 
the  route  is  not  in  use.  Finally  a  pointer  to  the  a  network  interface  is  kept;  packets  sent  using 
the  route  should  be  handed  to  this  interface. 

Routes  are  typed  in  two  ways:  either  as  host  or  network,  and  as  “direct”  or  “indirect’’.  The 
host/network  distinction  determines  how  to  compare  the  rt^dtt  field  during  lookup.  If  the  route 
is  to  a  network,  only  a  packet’s  destination  network  is  compared  to  the  entry  stored  in  the 

table.  If  the  route  is  to  a  host,  the  addresses  must  match  bit  for  bit. 

The  distinction  between  *  direct”  and  "indirect”  routes  indicates  whether  the  destination  is 
directly  connected  to  the  source.  This  is  needed  when  performing  local  network  encapsulation.  If 
a  packet  is  destined  for  a  peer  at  a  host  or  network  which  is  not  directly  connected  to  the  source, 
the  internetwork  packet  header  will  indicate  the  address  of  the  eventual  destination,  while  the 
local  network  header  will  indicate  the  address  of  the  intervening  gateway.  Should  the  destination 
be  directly  connected,  these  addresses  are  likely  to  be  identical,  or  a  mapping  between  the  two 
exists  The  RTF_GATEWAY  flag  indicates  the  route  is  to  an  “indirect”  gateway  agent  and  the 
local  network  header  should  be  filled  in  from  the  r(_f»lewa jr  field  instead  of  or  from  the 

internetwork  destination  address. 

It  is  assumed  multiple  routes  to  the  same  destination  will  not  be  present  unless  they  are 
deemed  eq*a!  in  cost  (the  current  routing  policy  process  never  installs  multiple  routes  to  the  same 
destination)  However,  should  multiple  routes  to  the  same  destination  exist,  a  request  for  a  route 
will  return  the  “least  used"  route  based  on  the  total  number  of  packets  sent  aloof  this  route 
This  can  result  in  a  “ping-pong”  effect  (alternate  packets  taking  alternate  routes),  unless  protocols 
“hold  ooto”  routes  until  they  no  longer  find  them  useful;  either  because  the  destination  has 
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changed,  or  because  the  route  is  Icasy. 

Routing  redirect  control  messages  are  used  to  dynamically  modify  existing  rooting  table 
entries  as  well  as  dynamically  create  new  routing  table  entries.  On  hosts  where  exhaustive  rout¬ 
ing  information  is  too  expensive  to  maintain  (e.g.  work  stations),  the  combination  of  wildcard 
routing  entries  and  routine  redirect  messages  can  be  used  to  provide  a  simple  routing  manage¬ 
ment  scheme  without  the  as'  of  a  higher  level  policy  process.  Statistics  are  kept  by  t^e  routing 
table  routines  on  the  use  of  routing  redirect  messages  and  their  affect  on  the  routing  tables 
These  statistics  may  be  viewed  using 

Status  information  other  than  routing  redirect  control  messages  may  be  used  in  the  future, 
hut  at  present  they  are  ignored.  Likewise,  more  intelligent  “metrics”  may  W  used  to  describe 
routes  m  the  future,  possibly  based  on  bandwidth  and  monetary  costs 

10.2.  Routing  table  interface 

A  protocol  accesses  the  routing  tables  through  three  routines,  one  to  allocate  a  route,  one  to 
free  a  route,  and  one  to  process  a  routing  redirect  eoutrol  mesaage.  The  routine  rttlloe  performs 
route  allocation;  it  is  ealled  with  a  pointer  to  the  following  structure, 

struct  route  { 

struct  rtentry  •ro_rt; 

struct  sockaddr  ro  dst, 

}; 

The  route  returned  is  assumed  “held”  by  the  caller  until  disposed  of  with  an  rtfrst  call.  Proto¬ 
cols  which  implement  virtual  circuits,  such  as  TCP,  hold  onto  routes  for  fie  duration  of  the 
circuit’s  lifetime,  while  connectioD-less  protocols,  such  as  UDP,  currently  allocate  and  free  routes 
on  each  transmission. 

The  routine  rtreiireet  is  called  to  process  a  routing  redirec  ~r?iro!  mesaage.  It  is  called 
with  a  destination  address  and  the  new  gateway  to  that  destination.  If  a  non-wildcard  route 
exists  to  the  destination,  the  gateway  entry  in  the  route  is  modified  to  point  at  the  new  gateway 
supplied.  Otherwise,  a  new  routing  table  entry  is  inserted  reflecting  the  information  supplied 
Routes  to  interfaces  and  routes  to  gateways  which  are  not  directly  areesible  from  the  host  are 
ignored. 

10.8.  Uaer  level  routing  policies 

Routing  policies  implemented  in  user  processes  manipulate  the  kernel  routing  tables  through 
two  iocti  calls.  The  commands  S10CADDRT  and  SIOCDELRT  add  and  delete  routing  entries, 
respectively;  the  tables  are  read  through  the  /dev/kmem  device.  The  decision  to  place  policy 
deesuons  in  a  user  process  implies  routing  table  updates  may  lag  a  bit  behind  the  identification  of 
»ew  routes,  or  the  failure  of  existing  routes,  hut  this  period  of  instability  is  normally  very  small 
with  proper  implementation  of  the  routing  process.  Advisory  information,  such  as  ICMP  error 
messages  and  IMP  diagnostic  messages,  may  be  read  from  raw  sockets  (described  in  the  next  sec¬ 
tion). 

One  routing  policy  process  has  already  been  implemented.  The  system  standard  “routing 
daemon"  uses  a  variant  of  the  Xerox  NS  Routing  Information  Protocol  pCerox82]  to  maintain  up 
to  date  routing  tables  in  our  loeil  environment.  Interaction  with  other  existing  routing  protocols, 
such  as  the  Internet  GGP  (Gateway-Gateway  Protocol),  may  be  accomplished  using  a  similar  pro- 


11.  Raw  sockets 

A  raw  socket  is  a  mechanism  which  allows  users  direct  access  to  a  lower  level  protocol.  Raw 
•ockrts  are  intended  for  knowledgeable  processes  which  wish  to  take  advantage  of  some  protocol 
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/eature  not  directly  accessible  through  the  normal  interface,  or  foi  the  development  of  new  proto- 
cola  built  atop  existing  lower  level  protocola.  For  example,  a  new  veraica  of  TCP  might  be 
developed  at  the  uaer  level  by  utilixing  a  raw  IP  socket  for  delivery  of  packete.  The  raw  IP 
socket  interface  attempt#  tc  provide  an  identical  interface  to  the  one  a  protocol  would  have  if  it 
were  resident  in  the  kernel. 

The  raw  socket  support  is  built  around  a  generic  raw  socket  interface,  and  (psmibly)  aug¬ 
mented  by  protocol-specific  processing  routines.  This  section  will  describe  the  core  of  the  raw 
•ocket  interface. 

11JL.  Control  blocks  ' 

Every  raw  socket  has  a  protocol  control  block  of  the  following  form, 


struct  raweb  { 
struct 

raweb  •rcbjiext; 

/•  doubly  linked  list  •/ 

struct 

raweb  •reb_prev; 

/•  back  pointer  to  socket  •/ 

struct 

socket  •reb_socket; 

struct 

soek&ddr  rcb_faddr; 

/•  destination  address  •/ 

struct 

sockaddr  rtbjaddr; 

/•  socket’s  address  •/ 

eaddr_t 

reb_peb; 

/•  protocol  specific  stuff  •/ 

short 

rebjlags; 

}; 

All  the  control  blocks  are  kept  on  a  doubly  linked  list  for  performing  lookups  during  packet 
dispatch.  Associations  may  be  recorded  in  the  control  block  and  used  by  the  output  routine  in 
preparing  packets  for  transmission.  The  addresses  art  also  used  to  filter  packeta  on  input;  this 
will  be  described  in  more  detail  shortly.  If  any  protocol  specific  information  is  required,  it  may  be 
attached  to  the  control  block  using  the  rei_pet  field. 

A  raw  socket  interface  »  datagram  oriented.  That  is,  each  send  or  receive  on  the  socket 
requires  a  destination  address.  This  address  may  be  supplied  by  the  user  or  stored  in  the  eontrol 
block  and  automatically  installed  in  the  outgoing  pseket  by  the  output  routine.  Since  it  is  not 
possible  to  determine  whether  an  address  is  present  or  not  in  the  eontrol  block,  two  flags, 
RAWJJkDDR  and  RAW _FADDR,  indicate  if  a  local  and  foreign  address  are  present.  Another 
flag,  RAWJDONTROUTE,  indicates  if  routing  should  be  performed  on  outgoing  packets.  If  it  is, 
a  route  is  expected  to  be  allocated  for  each  “new”  destination  address.  That  is,  the  first  time  a 
packet  is  transmitted  a  route  is  determined,  and  thereafter  each  time  the  destination  address 
stored  in  rcljroutc  differs  from  rthjtiir,  or  rtljrfte.rojrt  is  sero,  the  old  route  is  discarded 
and  a  new  one  allocated. 

11.2.  Input  processing 

Input  packets  are  “assigned”  to  raw  socket#  baaed  on  a  simple  pattern  matching  scheme. 
Each  network  interface  or  protocol  gives  packets  to  the  raw  input  routine  with  the  call 

raw_jnput(ml  proto,  arc,  dst) 

struct  mbuf  *m;  struct  sockproto  •proto,  struct  sockaddr  *src,  *dst; 

The  data  packet  then  ha#  a  generic  header  prepended  to  it  of  the  form 

struct  rawjbeader  { 

struct  sockproto  raw_proto; 

struct  sockaddr  r*w_dst; 

struct  sockaddr  raw_jrc; 

}; 

and  it  is  placed  in  a  packet  queue  for  the  “raw  input  protocol”  module  Packete  tdeen  from  this 
queue  are  copied  into  any  raw  socket#  that  match  the  header  according  to  the  following  rules, 
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1)  The  protocol  fsraily  of  the  socket  and  header  agree. 

2)  if  the  protocol  cumber  in  the  socket  is  non-sero,  then  it  agrees  with  that  found  in  the 
packet  header 

S)  If  a  local  address  is  defined  for  the  socket,  the  address  format  of  the  local  address  is  the 
same  as  the  destination  address’s  and  the  two  addresses  agree  bit  for  bit. 

4)  The  rules  of  3)  are  applied  to  the  socket’s  foreign  address  and  the  packet's  source  address. 

A  basic  assumption  is  that  addresses  present  in  the  control  block  and  packet  header  (as  con¬ 
structed  by  the  network  interface  and  any  raw  input  protocol  module)  are  in  a  canonical  form 
which  may  be  “block  compared”. 

11A.  Output  processing 

On  output  the  raw  pr_*4irtq  routine  passes  the  packet  and  raw  control  block  to  the  raw 
protocol  output  routine  for  any  processing  required  before  it  is  delivered  to  the  appropriate  net- 
work  interface.  The  output  routine  is  normally  the  only  code  required  to  implement  a  raw  socket 
interface. 


12.  Buffering  and  congestion  control 

One  of  the  major  factors  in  the  performance  of  a  protocol  is  the  buffering  policy  used.  Lack 
of  a  proper  buffering  policy  can  force  packets  to  be  dropped,  cause  falsified  windowing  informa¬ 
tion  to  be  emitted  by  protocol* ,  fragment  host  memory,  degrade  the  overall  host  performance, 
etc.  Due  to  problems  such  as  these,  most  systems  allocate  a  fixed  pool  of  memory  to  the  network¬ 
ing  system  and  impose  a  policy  optimised  for  “normal”  network  operation. 

The  networking  system  developed  for  UNIX  is  little  different  in  this  respect.  At  boot  time  a 
fixed  amount  of  memory  is  allocated  by  the  networking  system.  At  later  times  more  system 
memory  may  be  requested  as  the  need  arises,  but  at  no  time  is  memory  ever  returned  to  the  sys¬ 
tem.  It  is  possible  to  garbage  collect  memory  from  the  network,  but  difficult.  In  order  to  perform 
this  garbage  collection  properly,  tome  portion  of  the  network  will  have  to  be  “turned  off”  as  data 
stratum*  are  updated.  The  interval  over  which  this  occurs  must  kept  small  compared  to  the 
average  inter*packet  arrival  time,  or  too  much  traffic  may  be  lost,  impacting  other  oete  on  the 
network,  as  well  as  increasing  load  on  the  interconnecting  mediums.  In  our  environment  we  ^%ve 
not  experienced  a  need  for  such  compaction,  and  thus  have  left  the  problem  unresolved. 

The  mbuf  structure  was  introduced  in  chapter  5.  In  this  section  a  brief  description  will  be 
given  of  the  allocation  mechanisms,  and  policies  used  by  the  protocols  in  performing  ecnne'.tion 
level  buffering. 

12.1.  Memory  management 

The  basic  memory  allocation  routines  place  no  restrictions  on  the  amount  of  space  which 
may  be  allocated.  Any  request  made  is  filled  until  the  system  memory  allocator  starts  refusing  to 
allocate  additional  memory.  When  the  current  quota  of  memory  ii  insufficient  to  satisfy  an  mbuf 
allocation  request,  the  allocator  requests  enough  new  pages  from  the  system  to  satisfy  the  current 
request  only.  All  memory  owned  by  the  network  is  described  by  a  private  page  table  used  in 
remapping  pages  to  be  logically  contiguous  as  the  need  arises  In  addition,  an  array  of  reference 
•ounta  parallels  the  page  table  and  is  used  when  multiple  copies  of  a  page  are  present. 

Mbufs  are  128  byte  structures,  8  fitting  in  a  1Kbyte  page  of  memory.  When  data  is  placed 
in  mbufs  if  possible,  it  is  copied  or  remapped  into  logically  contiguous  pages  of  memory  from  the 
network  page  pool.  Data  smaller  than  the  site  of  a  page  is  copied  into  one  or  more  112  byte  mbu, 

data  areas 
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12  .2.  Protocol  buffering  policies 

Protocob  reserve  fixed  amounts  of  buffering  for  send  and  receive  queues  at  socket  oration 
time.  These  amounts  define  the  high  and  low  water  marks  used  by  the  socket  routines  in  deciding 
when  to  block  and  unblock  a  process.  The  reservation  of  space  does  not  currently  result  m  any 
action  by  the  memory  management  routines,  though  it  is  clear  if  one  imposed  an  upper  bound  on 
tbe  total  amount  of  physical  memory  allocated  to  the  network,  reserving  memory  wool#  become 

important. 

Protocols  which  provide  connection  level  flow  control  do  this  based  on  the  amount  of  space 
in  the  associated  socket  queues  That  is,  send  windows  art  calculated  based  on  the  amount  of 
free  space  in  the  socket’s  receive  queue,  while  receive  windows  are  adjusted  baaed  on  the  amount 
of  data  awaiting  transmission  in  the  send  queue.  Care  has  been  taken  to  avoid  the  “ailly  window 
syndrome”  described  in  (Clark82]  st  both  the  sending  and  receiving  ends. 

12.8.  Queue  limiting 

Incoming  packets  from  the  network  are  always  received  unless  memory  allocation  fails. 
However  each  Level  1  protocol  input  queue  has  an  upper  bound  on  the  queue’s  litgth,  and  any 
packets  Exceeding  thst  bound  are  discarded.  It  is  possible  for  a  host  to  be  overwhelmed  by  exces¬ 
sive  network  traffic  (for  instance  a  host  acting  as  a  gateway  from  a  high  bandwidth  network  to  a 
low  bandwidth  network).  As  a  “defensive”  mechanism  the  queue  limits  may  be  adjuated  to  throt¬ 
tle  network  traffic  load  on  a  host.  Consider  a  host  willing  to  devote  some  percentage  of  its 
machine  to  handling  network  traffic.  If  the  cost  of  handling  an  incoming  packet  can  he  calculated 
•o  that  an  acceptable  “packet  handling  rate”  can  be  determined,  then  input  queue  lengths  may  be 
dynamicallv  adjusted  baaed  on  a  host's  network  load  and  the  number  of  packets  awaiting  process¬ 
ing.  Obviously,  discarding  packets  is  not  a  aatisfactory  aolutioo  to  a  problem  such  as  this  (simply 
dropping  packets  is  likely  to  increase  the  load  os  a  network);  the  queue  lengths  were  incorporated 
mainly  as  a  safeguard  mechanism. 

12.4.  Packet  forwarding 

When  packets  can  not  be  forwarded  because  of  memory  limitation?,  the  system  generates  a 
“source  quench”  message.  In  addition,  any  other  problems  encountered  during  packet  forwarding 
art  also  reflected  back  to  the  sender  in  the  form  of  ICMP  packets.  This  helps  hosts  avoid 
unneeded  retransmissions. 

Broadcast  packets  are  never  forwarded  due  to  possible  dire  consequences.  In  an  early  stage 
of  network  development,  broadcast  packets  were  forwarded  and  a  “routing  loop”  resulted  in  net¬ 
work  saturation  and  every  host  on  the  network  crashing. 


18.  Out  of  band  data 

Out  of  band  data  is  a  facility  peculiar  to  the  stream  socket  abstraction  defined  Little 
agreement  appears  to  exist  as  to  what  its  semantics  should  be.  TCP  defines  the  notion  of  urgent 
dlta”  as  in-line,  while  the  NBS  protocols  [Burruss8l]  and  numerous  others  provide  a  fully 
independent  logical  transmission  channel  along  which  out  or  band  data  is  to  be  sent.  In  addition, 
the  amount  of  the  data  which  may  he  rent  as  an  out  of  band  message  vanes  from  protocol  to  pro 
tocol,  everything  from  1  bit  to  16  bytes  or  more. 

A  stream  wicket's  notion  of  out  of  band  data  has  been  defined  as  the  lowest  reasonable  com- 
mon  denominator  (at  least  reasonable  in  our  minds);  clearly  this  is  subject  to  debate.  Out  of 
band  data  is  expected  to  be  transmitted  out  or  the  normal  sequencing  and  flow  control  eonstnunts 
of  the  data  stream  A  minimum  of  1  byte  of  out  of  '.and  data  and  one  outstanding**1  of  band 
message  are  expected  to  be  supported  by  the  protocol  supporting  a  stream  socket  H  »  a  proto¬ 
col,  per-gative  to  support  larger  sised  messages,  or  more  than  one  outstanding  out  of  band  mes¬ 
sage  at  a  time. 
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Out  of  band  data  is  maintained  by  the  protocol  and  usually  not  stored  in  the  socket's  send 
queue  The  PRUJ5ENDOOB  and  PRU_RCV 00B  requests  to  the  pr_**rrcg  routine  aie  used  in 
•ending  and  receiving  data. 


14.  Trailer  protocols 

Core  to  core  copies  can  be  expensive.  Consequently,  a  great  deal  of  effort  was  spent  in 
■uni mixing  such  operations.  The  VAX  architecture  provides  virtual  memory  hardware  organized 
fen  page  units.  To  cut  down  on  copy  operations,  data  ia  kept  in  page  used  units  on  page-aligned 
boundaries  whenever  possible.  This  allows  data  to  be  moved  in  memory  simply  by  remapping  the 
page  instead  of  copying.  The  mbuf  and  network  interface  routines  perform  page  table  manipula¬ 
tions  where  needed,  hiding  the  complexities  of  the  VAX  virtual  memory  hardware  from  higher 
level  code. 

Data  enters  the  system  in  two  ways:  from  the  user,  or  from  the  oetwork  (hardware  inter¬ 
face).  When  data  is  copied  from  the  user's  address  space  into  the  system  it  is  deposited  in  pages 
(if  sufficient  data  is  present  to  fill  an  entire  page).  This  encourages  the  user  to  transmit  informa¬ 
tion  in  messages  which  are  a  multiple  of  the  system  page  aise. 

Unfortunately,  performing  a  similar  operation  when  taking  data  from  the  network  is  very 
difficult.  Consider  the  format  of  an  incoming  packet.  A  packet  usually  contains  a  local  network 
header  followed  by  one  or  more  headers  uaed  by  the  high  level  protocol!.  Finally,  the  data,  if  any, 
follows  these  headers.  Since  the  header  information  may  be  variable  length,  DMA’ing  the  even¬ 
tual  eltla  tor  the  user  into  a  page  aligned  ares.  of  memory  is  impossible  without  a  priori  knowledge 
of  the  format  (e  g.  supporting  only  •  single  protocol  header  format). 

T<  cJQow  variaMe  length  header  infer*.'  awon  to  be  preaent  and  still  ensure  page  alignment  of 
data,  a  special  loesw  network  encapsulation  may  be  uaed.  This  encapsulation,  termed  a  trailer 
prots  ci  l,  places  the  variable  length  header  information  after  (he  data.  A  fixed  aise  local  network 
header  is  then  prepended  to  the  resultant  packet.  The  local  network  header  contains  the  aise  of 
the  data  portion,  and  a  sew  trailer  pr etotol  header,  inserted  before  the  variable  length  informa¬ 
tion,  eontams  the  rise  of  the  variable  length  header  information.  The  following  trailer  protocol 
header  is  uaed  to  store  information  regarding  the  variable  length  protocol  header: 

struct  ( 

abort  protocol;  /•  original  protocol  no.  •/ 
abort  length;  /*  length  of  trailer  •/ 

}; 

The  processing  of  the  trniler  protocol  is  very  simple.  On  output,  the  local  network  header 
indicates  a  trailer  encapsulation  u  being  used.  The  protocol  identifier  also  includes  an  indication 
of  the  number  of  data  pages  present  (before  the  trailer  protocol  header).  The  trailer  protocol 
header  is  initialised  to  contain  the  actual  protocol  and  variable  length  header  aise,  and  appended 
to  the  data  along  with  t.f>-  variable  length  header  information. 

On  input,  the  interlace  routines  Identify  the  trailer  encapsulation  by  the  protocol  type 
stored  in  the  local  network  ht*J*r,  then  calculate  the  cumber  of  pages  of  data  to  find  the  begin¬ 
ning  of  the  trailer.  The  traiUr  »  lAiarmatiot  is  copied  into  a  separate  mbuf  and  linked  to  the  froDt 
of  the  resultant  packet. 

Clearly,  trailer  protocols  require  cooperation  *«e‘°-<ren  source  sad  destination.  In  addition, 
they  are  normally  cost  effective  only  when  usable  packets  are  used  The  current  scheme  works 
because  the  local  network  encapsulate n  header  k  a  fixed  rise,  allowing  DMA  operations  to  be  per¬ 
formed  at  a  known  offset  from  the  &  rt  data  page  basing  received  Should  the  local  network 
header  be  variable  length  this  scheme  frib 

Statistics  collected  indicate  as  much  v  200a vmt  U  jsined  by  using  a  trailer  protocol 
with  1Kbyte  packets.  The  average  rise  of  Ok  var  -  •.  length  header  was  40  bytes  (the  rise  of  a 
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